[ofa-general] Oops with today's OFED 1.3

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Mon Feb 4 12:24:14 PST 2008


I pulled today's (Feb 4th) OFED build and saw the following Oops while touch testing
on ehca1 on a 2.6.24 kernel.

Modules linked in: ib_ipoib ib_cm ib_sa ib_uverbs ib_umad ib_ehca ib_mthca ib_mad ib_core joydev st ide_cd ipv6 sg pdc202xx_new e1000 ibmveth dm_mod ipr libata firmware_class sr_mod cdrom sd_mod scsi_mod
NIP: d000000000299ca8 LR: d000000000299a70 CTR: d00000000015ec04
REGS: c0000001cc85f3b0 TRAP: 0300   Not tainted  (2.6.23-ppc64)
MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 24022424  XER: 00000020
DAR: 000000000000002c, DSISR: 0000000042000000
TASK = c0000001d883d4a0[17052] 'modprobe' THREAD: c0000001cc85c000 CPU: 2
GPR00: 0000000000000000 c0000001cc85f630 d0000000002b5cf0 ffffffffffffffda 
GPR04: c0000001cc85f760 ffffffffffffffda d0000000002a7eb0 0000000000000000 
GPR08: 0000000000000000 0000000000000000 0000000000000001 00000000001b4800 
GPR12: d00000000029ef30 c0000000005a8280 c0000001d895aa20 0000000000000000 
GPR16: 0000000000000008 0000000000000000 0000000000000000 d00000000040f27e 
GPR20: 0000000000000211 0000000000000000 0000000000000000 c0000001cd1e0000 
GPR24: 0000000000000000 d0000000002ad9d8 d0000000002a7eb0 0000000000000001 
GPR28: c0000001cc85f760 0000000000000000 d0000000002b4ce0 c0000001cd1e0780 
NIP [d000000000299ca8] .ipoib_cm_dev_init+0x440/0x63c [ib_ipoib]
LR [d000000000299a70] .ipoib_cm_dev_init+0x208/0x63c [ib_ipoib]
Call Trace:
[c0000001cc85f630] [d000000000299a70] .ipoib_cm_dev_init+0x208/0x63c [ib_ipoib] (unreliable)
[c0000001cc85f7d0] [d000000000297f4c] .ipoib_transport_dev_init+0x120/0x458 [ib_ipoib]
[c0000001cc85f930] [d00000000029463c] .ipoib_ib_dev_init+0x44/0xb8 [ib_ipoib]
[c0000001cc85f9c0] [d0000000002902ec] .ipoib_dev_init+0xe0/0x138 [ib_ipoib]
[c0000001cc85fa60] [d000000000290544] .ipoib_add_one+0x200/0x424 [ib_ipoib]
[c0000001cc85fb20] [d0000000001610e4] .ib_register_client+0x94/0xf4 [ib_core]
[c0000001cc85fbb0] [d00000000029dcac] .ipoib_init_module+0x1f8/0x246c [ib_ipoib]
[c0000001cc85fc70] [c0000000000905f0] .sys_init_module+0x176c/0x187c
[c0000001cc85fe30] [c00000000000852c] syscall_exit+0x0/0x40
Instruction dump:
801f0f20 3b600000 2f800000 409d0040 e81f0f30 e97f04f0 7b6926e4 395b0001 
7d5b07b4 7c080214 816b0018 7d290214 <9169002c> 60000000 60000000 60000000 


I tracked this down to the following area of code:
+       for (j = 0; j < ipoib_recvq_size; ++j) {
+               for (i = 0; i < priv->cm.num_frags; ++i)
+                       priv->cm.rx_wr_arr[j].rx_sge[i].lkey = priv->mr->lkey;


in ipoib_0230_srq_post_n.patch.

Touch tested after removing this patch seems to solve the problem.

Pradeep




More information about the general mailing list