[ofa-general] Oops with today's OFED 1.3

Pradeep Satyanarayana pradeeps at linux.vnet.ibm.com
Tue Feb 5 22:18:18 PST 2008


Pradeep Satyanarayana wrote:
> Eli Cohen wrote:
>> Pradeep,
>> Can you check if this is resolved?
>>
>> On 2/4/08, Pradeep Satyanarayana <pradeeps at linux.vnet.ibm.com> wrote:
>>> I pulled today's (Feb 4th) OFED build and saw the following Oops while touch testing
>>> on ehca1 on a 2.6.24 kernel.
>>>
> 
> <snip>
> 
> 
>>> NIP [d000000000299ca8] .ipoib_cm_dev_init+0x440/0x63c [ib_ipoib]
>>> LR [d000000000299a70] .ipoib_cm_dev_init+0x208/0x63c [ib_ipoib]
>>> Call Trace:
>>> [c0000001cc85f630] [d000000000299a70] .ipoib_cm_dev_init+0x208/0x63c [ib_ipoib] (unreliable)
>>> [c0000001cc85f7d0] [d000000000297f4c] .ipoib_transport_dev_init+0x120/0x458 [ib_ipoib]
>>> [c0000001cc85f930] [d00000000029463c] .ipoib_ib_dev_init+0x44/0xb8 [ib_ipoib]
>>> [c0000001cc85f9c0] [d0000000002902ec] .ipoib_dev_init+0xe0/0x138 [ib_ipoib]
>>> [c0000001cc85fa60] [d000000000290544] .ipoib_add_one+0x200/0x424 [ib_ipoib]
>>> [c0000001cc85fb20] [d0000000001610e4] .ib_register_client+0x94/0xf4 [ib_core]
>>> [c0000001cc85fbb0] [d00000000029dcac] .ipoib_init_module+0x1f8/0x246c [ib_ipoib]
>>> [c0000001cc85fc70] [c0000000000905f0] .sys_init_module+0x176c/0x187c
>>> [c0000001cc85fe30] [c00000000000852c] syscall_exit+0x0/0x40
>>> Instruction dump:
>>> 801f0f20 3b600000 2f800000 409d0040 e81f0f30 e97f04f0 7b6926e4 395b0001
>>> 7d5b07b4 7c080214 816b0018 7d290214 <9169002c> 60000000 60000000 60000000
> 
> Hello Eli,
> 
> Yes, this particular issue has been solved. However, I do see some other issues.
> 
> I seeing some new messages (not seen previously) in dmesg relating to 
> ib_cq_destroy() (on ehca):
> 
> ib0: ib_cq_destroy failed
> ib_destroy_srq failed: -16
> ib_dealloc_pd failed
> 
> This happens after some network tests and an rmmod of ib_ehca.
> 
> At this point my guess is that this has to do with the split CQ patch. I have not 
> had enough cycles to state that with absolute certainty. Can you please take a look too?
> 
> Pradeep
> 

I looked at this some more. This error occurs because ib_cq_destroy() for rcq failed.
After that there are a series of cascading failures.

Pradeep




More information about the general mailing list