[PATCH] Re: [openib-general] Re: IPoIB Failure CQ overrun
Roland Dreier
roland at topspin.com
Tue Dec 21 11:24:22 PST 2004
Michael> I'm a bit ill, expect to work on it tomorrow. Could you
Michael> post the patch with these dumps?
The patch is below.
Testing on PCI Express/Arbel systems (with dual 3.2 GHz Xeons) with FW
4.5.3, I've seen somewhat different behavior. Even after setting
IPOIB_NUM_WC to 1 so that IPoIB never polls more than 1 CQE at a time
(and so we always inc the CI by exactly 1), I still get the CQ
overrun. The debug patch produces this output:
ib_mthca 0000:02:00.0: CQ overrun on CQN 00000082
ib0: timing out; 3 sends 128 receives not completed
divert: no divert_blk to free, ib0 not ethernet
context for CQN 82
cons_index 7f, nfrees 17d47f, ndfrees 0
[ 0] 90040000
[ 4] 00000000
[ 8] 00000000
[ c] e8000001
[10] 00000002
[14] 00000001
[18] 02000000
[1c] 00000023
[20] 0000007d
[24] 7fffffff
[28] 00000080
[2c] 0000007f
[30] f8000082
[34] 002483b4
[38] 00000001
[3c] 00000000
You can see that the HW's consumer index (at offset 0x28) is 0x80, one
more than the driver thinks it should be. Even more interesting is
this dump from the other system on the other end of this netpipe run:
context for CQN 82
cons_index 81, nfrees 17d481, ndfrees 0
[ 0] 00040100
[ 4] 00000000
[ 8] 00000000
[ c] e8000001
[10] 00000002
[14] 00000001
[18] 02000000
[1c] 00000023
[20] 00000080
[24] 00000080
[28] 000000a7
[2c] 00000081
[30] f8000082
[34] 002483b4
[38] 00000001
[3c] 00000000
Here the HW CI is 0xa7, way higher than the driver's value. It looks
like we avoided the CQ overrun because the CI got bumped past the
correct value more than once.
Thanks,
Roland
Index: hw/mthca/mthca_provider.h
===================================================================
--- hw/mthca/mthca_provider.h (revision 1370)
+++ hw/mthca/mthca_provider.h (working copy)
@@ -138,6 +138,8 @@
int cqn;
int cons_index;
int is_direct;
+ int nfrees;
+ int ndfrees;
union {
struct mthca_buf_list direct;
struct mthca_buf_list *page_list;
Index: hw/mthca/mthca_cq.c
===================================================================
--- hw/mthca/mthca_cq.c (revision 1370)
+++ hw/mthca/mthca_cq.c (working copy)
@@ -535,6 +535,8 @@
}
if (freed) {
+ cq->nfrees += freed;
+ cq->ndfrees += freed - 1;
wmb();
inc_cons_index(dev, cq, freed);
}
@@ -706,6 +708,7 @@
spin_unlock_irq(&dev->cq_table.lock);
cq->cons_index = 0;
+ cq->nfrees = cq->ndfrees = 0;
kfree(dma_list);
kfree(mailbox);
@@ -764,11 +767,13 @@
mthca_warn(dev, "HW2SW_CQ returned status 0x%02x\n",
status);
- if (0) {
+ if (1) {
u32 *ctx = MAILBOX_ALIGN(mailbox);
int j;
printk(KERN_ERR "context for CQN %x\n", cq->cqn);
+ printk(KERN_ERR "cons_index %x, nfrees %x, ndfrees %x\n",
+ cq->cons_index, cq->nfrees, cq->ndfrees);
for (j = 0; j < 16; ++j)
printk(KERN_ERR "[%2x] %08x\n", j * 4, be32_to_cpu(ctx[j]));
}
More information about the general
mailing list