[openib-general] Race in mthca_cmd_post()
John Partridge
johnip at sgi.com
Fri Oct 13 12:17:30 PDT 2006
Roland,
I have been testing OFED-1.1 on SGI Altix (ia64) and on certain of our machines
we see a kernel panic because of a PIO read error coming from mthca_cmd_post()
Here's the stack trace :-
0xe000006040f28000 4957 4913 0 4 R 0xe000006040f283a0 *modprobe
0xa00000010000caa0 ia64_leave_kernel
0xa00000010041a720 sn_dma_flush+0x20
args (0xc000018010780698, 0xe000016003ad8280, 0xa00000010013dc90, 0x308, 0x286)
0xa00000010040f940 ___sn_readl+0x40
args (0xc000018010780698, 0xfff7fff2, 0xa00000020628d2c0, 0x60d, 0xe000026003caa500)
0xa00000020628d2c0 [ib_mthca]mthca_cmd_post+0x520
args (0xe000026003caa000, 0x0, 0x8000000000000000, 0x0, 0x0)
0xa00000020628daa0 [ib_mthca]mthca_cmd_poll+0xa0
args (0xe000026003caa000, 0x0, 0xe000006040f2fa10, 0x1, 0x0)
0xa00000020628def0 [ib_mthca]mthca_cmd_imm+0xd0
args (0xe000026003caa000, 0x0, 0xe000006040f2fa10, 0x0, 0x0)
0xa00000020628e0b0 [ib_mthca]mthca_SYS_EN+0x50
args (0xe000026003caa408, 0xe000006040f2fa22, 0xe000026079df5880, 0xa00000020628a680, 0x40e)
0xa00000020628a680 [ib_mthca]mthca_init_hca+0x1720
args (0xe000026003caa000, 0xa00000020628b7f0, 0xe000006040f2fa22, 0xe000026003caa418, 0xa0000001008ee010)
0xa00000020628bd00 [ib_mthca]__mthca_init_one+0xe60
args (0xe00003607a234800, 0x0, 0xe000026003caa408, 0xe000026003caa000, 0x0)
0xa00000020628cd40 [ib_mthca]mthca_init_one+0x100
args (0xe00003607a234800, 0xa0000002062cc0a8, 0xa0000002062ed450, 0xffffffffffffffed, 0xa0000001002b7520)
0xa0000001002b7520 pci_device_probe+0x260
[4]more>
args (0xa000000100720e80, 0xa000000100720ea8, 0xe00003607a234800, 0xa0000002062cc340, 0x0)
0xa0000001003a9d40 driver_probe_device+0x100
args (0xa0000002062cc398, 0xe00003607a234870, 0x205, 0xe00003607a234a18, 0xa0000001003aa040)
0xa0000001003aa040 __driver_attach+0xc0
args (0xe00003607a234870, 0xa0000002062cc398, 0xe00003607a2349f0, 0xa0000001003a9020, 0x38a)
0xa0000001003a9020 bus_for_each_dev+0x80
args (0x0, 0x0, 0xa0000002062cc398, 0xa00000010061ac40, 0xa0000001003a9b60)
0xa0000001003a9b60 driver_attach+0x40
args (0xa0000002062cc398, 0xa0000001003a87a0, 0x40b, 0x40b)
I put a PCI-X analyzer on the bus along with the HCA and found that we saw a Memory Read to register 698
but no evidence of the SYS_EN command making it down to the card. We were trying to read the gobit before
the DOORBELL had completed. I think this could only happen on multi CPU machines with fast CPU's and an
architecture which does not do PIO ordering very well.
To fix this I have the following patch. Please can you look at it and let me know what you think :-
--- drivers/infiniband/hw/mthca/mthca_cmd.c 2006-10-05 08:07:01.000000000 -0500
+++ fix/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-10-13 14:01:09.104455038 -0500
@@ -282,12 +282,15 @@
mutex_lock(&dev->cmd.hcr_mutex);
- if (event && dev->cmd.flags & MTHCA_CMD_POST_DOORBELLS && fw_cmd_doorbell)
+ if (event && dev->cmd.flags & MTHCA_CMD_POST_DOORBELLS && fw_cmd_doorbell) {
+ mmiowb();
mthca_cmd_post_dbell(dev, in_param, out_param, in_modifier,
op_modifier, op, token);
- else
+ } else {
+ mmiowb();
err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier,
op_modifier, op, token, event);
+ }
mutex_unlock(&dev->cmd.hcr_mutex);
return err;
Thanks
John
--
John Partridge
Silicon Graphics Inc
Tel: 651-683-3428
Vnet: 233-3428
E-Mail: johnip at sgi.com
More information about the general
mailing list