[ofa-general] Oops in mthca

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Thu Dec 20 16:11:50 PST 2007


I discovered the following Oops while developing a patch to enable SRQ on HCAs with fewer than
16 SG elements.

The root of this issue appears to be that ib_query_device(priv->ca, &attr)
reports an incorrect value for attr.max_srq_sge. The value that
ib_query_device returns is 28 (instead of 16 that I expected).


Dec 20 13:19:47 elm3b39 kernel: Oops: Kernel access of bad area, sig: 11 [#2]
Dec 20 13:19:47 elm3b39 kernel: SMP NR_CPUS=128 NUMA pSeries
Dec 20 13:19:47 elm3b39 kernel: Modules linked in: ib_ipoib autofs4 rdma_ucm rdma_cm ib_addr iw_cm ib_uverbs ib_umad ib_mthca ib_cm ib_sa ib_mad ib_core ipv6 binfmt_misc parport_pc lp parport sg e1000 dm_snapshot dm_zero dm_mirror dm_mod ipr libata firmware_class sd_mod scsi_mod ehci_hcd ohci_hcd usbcore
Dec 20 13:19:47 elm3b39 kernel: NIP: d0000000002ffb60 LR: d0000000002ffb08 CTR: c00000000043a9b0
Dec 20 13:19:47 elm3b39 kernel: REGS: c0000001d05ff2e0 TRAP: 0300   Tainted: G      D  (2.6.24-rc5)
Dec 20 13:19:47 elm3b39 kernel: MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24024424  XER: 00000010
Dec 20 13:19:47 elm3b39 kernel: DAR: 0000000060bf0008, DSISR: 0000000040000000
Dec 20 13:19:47 elm3b39 kernel: TASK = c0000001d2e4a000[8233] 'modprobe' THREAD: c0000001d05fc000 CPU: 4
Dec 20 13:19:47 elm3b39 kernel: GPR00: 0000000000000001 c0000001d05ff560 d000000000320308 c0000001d2e54010
Dec 20 13:19:47 elm3b39 kernel: GPR04: 0000000000000000 0000000000000001 c0000001d0654000 0000000000000001
Dec 20 13:19:47 elm3b39 kernel: GPR08: 0000000000000000 000000000000001c 0000000060bf0000 0000000060bf0000
Dec 20 13:19:47 elm3b39 kernel: GPR12: d000000000301fc8 c00000000057f600 d0000000005a2090 d0000000005a20d0
Dec 20 13:19:47 elm3b39 kernel: GPR16: 0000000000000000 00000000000001e3 00000000000001e3 d00000000032eba0
Dec 20 13:19:47 elm3b39 kernel: GPR20: 0000000000000000 0000000000000034 c0000001d05ff690 0000000000000001
Dec 20 13:19:47 elm3b39 kernel: GPR24: c0000000e482b000 0000000000000000 0000000000000000 0000000000000000
Dec 20 13:19:47 elm3b39 kernel: GPR28: c0000001d2972c00 0000000000000000 d00000000031f190 c0000001d020ee78
Dec 20 13:19:47 elm3b39 kernel: NIP [d0000000002ffb60] .mthca_tavor_post_srq_recv+0xe0/0x2e0 [ib_mthca]
Dec 20 13:19:47 elm3b39 kernel: LR [d0000000002ffb08] .mthca_tavor_post_srq_recv+0x88/0x2e0 [ib_mthca]
Dec 20 13:19:47 elm3b39 kernel: Call Trace:
Dec 20 13:19:47 elm3b39 kernel: [c0000001d05ff560] [d0000000002ffad4] .mthca_tavor_post_srq_recv+0x54/0x2e0 [ib_mthca] (unreliable)
Dec 20 13:19:47 elm3b39 kernel: [c0000001d05ff620] [d0000000003239fc] .ipoib_cm_post_receive_srq+0xbc/0x150 [ib_ipoib]
Dec 20 13:19:47 elm3b39 kernel: [c0000001d05ff6d0] [d000000000325984] .ipoib_cm_dev_init+0x2f4/0x560 [ib_ipoib]
Dec 20 13:19:47 elm3b39 kernel: [c0000001d05ff870] [d000000000322c74] .ipoib_transport_dev_init+0xd4/0x330 [ib_ipoib]
Dec 20 13:19:47 elm3b39 kernel: [c0000001d05ff970] [d00000000031f90c] .ipoib_ib_dev_init+0x3c/0xc0 [ib_ipoib]
Dec 20 13:19:47 elm3b39 kernel: [c0000001d05ffa00] [d00000000031aaac] .ipoib_dev_init+0x9c/0x160 [ib_ipoib]
Dec 20 13:19:48 elm3b39 kernel: [c0000001d05ffaa0] [d00000000031ad98] .ipoib_add_one+0x228/0x3b0 [ib_ipoib]
Dec 20 13:19:48 elm3b39 kernel: [c0000001d05ffb60] [d0000000001bf6ec] .ib_register_client+0xcc/0x110 [ib_core]
Dec 20 13:19:48 elm3b39 kernel: [c0000001d05ffc00] [d000000000328484] .ipoib_init_module+0x174/0x2288 [ib_ipoib]
Dec 20 13:19:48 elm3b39 kernel: [c0000001d05ffc90] [c00000000008eeec] .sys_init_module+0x20c/0x1aa0
Dec 20 13:19:48 elm3b39 kernel: [c0000001d05ffe30] [c0000000000086ac] syscall_exit+0x0/0x40
Dec 20 13:19:48 elm3b39 kernel: Instruction dump:
Dec 20 13:19:48 elm3b39 kernel: 419c0204 2f890000 38630010 38e00000 409d0060 38e00000 39000000 60000000
Dec 20 13:19:48 elm3b39 kernel: e95f0010 38070001 7c0707b4 7d6a4214 <800b0008> 90030000 60000000 60000000


lspci -v gives me the following:

0002:d8:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) (prog-if 00 [Normal decode])
        Flags: bus master, 66MHz, medium devsel, latency 144
        Bus: primary=d8, secondary=d9, subordinate=d9, sec-latency=128
        Memory behind bridge: c0000000-c88fffff
        Capabilities: [70] PCI-X bridge device

0002:d9:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
        Subsystem: Mellanox Technologies MT23108 InfiniHost
        Flags: bus master, 66MHz, medium devsel, latency 144, IRQ 121
        Memory at 400c8800000 (64-bit, non-prefetchable) [size=1M]
        Memory at 400c8000000 (64-bit, prefetchable) [size=8M]
        Memory at 400c0000000 (64-bit, prefetchable) [size=128M]
        Capabilities: [40] MSI-X: Enable- Mask- TabSize=32
        Capabilities: [50] Vital Product Data
        Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
        Capabilities: [70] PCI-X non-bridge device

Pradeep





More information about the general mailing list