[openib-general] Race in mthca_cmd_post()

John Partridge johnip at sgi.com
Fri Oct 13 12:17:30 PDT 2006


Roland,

I have been testing OFED-1.1 on SGI Altix (ia64) and on certain of our machines
we see a kernel panic because of a PIO read error coming from mthca_cmd_post()
Here's the stack trace :-

0xe000006040f28000     4957     4913  0    4   R  0xe000006040f283a0 *modprobe
0xa00000010000caa0 ia64_leave_kernel
0xa00000010041a720 sn_dma_flush+0x20
         args (0xc000018010780698, 0xe000016003ad8280, 0xa00000010013dc90, 0x308, 0x286)
0xa00000010040f940 ___sn_readl+0x40
         args (0xc000018010780698, 0xfff7fff2, 0xa00000020628d2c0, 0x60d, 0xe000026003caa500)
0xa00000020628d2c0 [ib_mthca]mthca_cmd_post+0x520
         args (0xe000026003caa000, 0x0, 0x8000000000000000, 0x0, 0x0)
0xa00000020628daa0 [ib_mthca]mthca_cmd_poll+0xa0
         args (0xe000026003caa000, 0x0, 0xe000006040f2fa10, 0x1, 0x0)
0xa00000020628def0 [ib_mthca]mthca_cmd_imm+0xd0
         args (0xe000026003caa000, 0x0, 0xe000006040f2fa10, 0x0, 0x0)
0xa00000020628e0b0 [ib_mthca]mthca_SYS_EN+0x50
         args (0xe000026003caa408, 0xe000006040f2fa22, 0xe000026079df5880, 0xa00000020628a680, 0x40e)
0xa00000020628a680 [ib_mthca]mthca_init_hca+0x1720
         args (0xe000026003caa000, 0xa00000020628b7f0, 0xe000006040f2fa22, 0xe000026003caa418, 0xa0000001008ee010)
0xa00000020628bd00 [ib_mthca]__mthca_init_one+0xe60
         args (0xe00003607a234800, 0x0, 0xe000026003caa408, 0xe000026003caa000, 0x0)
0xa00000020628cd40 [ib_mthca]mthca_init_one+0x100
         args (0xe00003607a234800, 0xa0000002062cc0a8, 0xa0000002062ed450, 0xffffffffffffffed, 0xa0000001002b7520)
0xa0000001002b7520 pci_device_probe+0x260
[4]more>
         args (0xa000000100720e80, 0xa000000100720ea8, 0xe00003607a234800, 0xa0000002062cc340, 0x0)
0xa0000001003a9d40 driver_probe_device+0x100
         args (0xa0000002062cc398, 0xe00003607a234870, 0x205, 0xe00003607a234a18, 0xa0000001003aa040)
0xa0000001003aa040 __driver_attach+0xc0
         args (0xe00003607a234870, 0xa0000002062cc398, 0xe00003607a2349f0, 0xa0000001003a9020, 0x38a)
0xa0000001003a9020 bus_for_each_dev+0x80
         args (0x0, 0x0, 0xa0000002062cc398, 0xa00000010061ac40, 0xa0000001003a9b60)
0xa0000001003a9b60 driver_attach+0x40
         args (0xa0000002062cc398, 0xa0000001003a87a0, 0x40b, 0x40b)

I put a PCI-X analyzer on the bus along with the HCA and found that we saw a Memory Read to register 698
but no evidence of the SYS_EN command making it down to the card. We were trying to read the gobit before
the DOORBELL had completed. I think this could only happen on multi CPU machines with fast CPU's and an
architecture which does not do PIO ordering very well.

To fix this I have the following patch. Please can you look at it and let me know what you think :-


--- drivers/infiniband/hw/mthca/mthca_cmd.c     2006-10-05 08:07:01.000000000 -0500
+++ fix/drivers/infiniband/hw/mthca/mthca_cmd.c 2006-10-13 14:01:09.104455038 -0500
@@ -282,12 +282,15 @@

         mutex_lock(&dev->cmd.hcr_mutex);

-       if (event && dev->cmd.flags & MTHCA_CMD_POST_DOORBELLS && fw_cmd_doorbell)
+       if (event && dev->cmd.flags & MTHCA_CMD_POST_DOORBELLS && fw_cmd_doorbell) {
+               mmiowb();
                 mthca_cmd_post_dbell(dev, in_param, out_param, in_modifier,
                                            op_modifier, op, token);
-       else
+       } else {
+               mmiowb();
                 err = mthca_cmd_post_hcr(dev, in_param, out_param, in_modifier,
                                          op_modifier, op, token, event);
+       }

         mutex_unlock(&dev->cmd.hcr_mutex);
         return err;


Thanks
John
-- 
John Partridge

Silicon Graphics Inc
Tel:  651-683-3428
Vnet: 233-3428
E-Mail: johnip at sgi.com




More information about the general mailing list