[openib-general] Re: ipoib oops

Michael S. Tsirkin mst at mellanox.co.il
Mon Nov 7 07:57:56 PST 2005


> Quoting Michael S. Tsirkin <mst at mellanox.co.il>:
> Subject: ipoib oops
> 
> Hi!
> I saw this in /var/log/messages recently.
> Unfortunately I cant say exactly what I did to trigger this problem.

Oops, I left out part of the log.
Here it is in full.
Actually, I had opensm running on the same node, and it appears
stuck in defunc state currently - I wander whether
we have the umad module crashing and corrupting the ipoib data
structures, or the reverse.

------------------------------

Unable to handle kernel NULL pointer dereference at 0000000000000488 RIP:
<ffffffff88040154>{:ib_ipoib:ipoib_mcast_join_finish+100}
PGD 1775cc067 PUD 177a21067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core
Pid: 12176, comm: ib_mad1 Not tainted 2.6.14 #4
RIP: 0010:[<ffffffff88040154>] <ffffffff88040154>{:ib_ipoib:ipoib_mcast_join_finish+100}
RSP: 0000:ffff810178d59c58  EFLAGS: 00010282
RAX: 0000000052010000 RBX: 0000000000000000 RCX: 0000000000000010
RDX: ffff8101536b77c0 RSI: ffff8101536b77c0 RDI: ffff8101536b77c0
RBP: ffff8101536b77c0 R08: 0000000000000000 R09: ffff810178d59d38
R10: ffff810178d59df8 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000480 R14: 0000000000000000 R15: ffff810152a32298
FS:  0000000000000000(0000) GS:ffffffff805ff800(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000488 CR3: 0000000154c92000 CR4: 00000000000006e0
Process ib_mad1 (pid: 12176, threadinfo ffff810178d58000, task ffff81017c587580)
Stack: ffff810178d59c90 0000000000000282 ffff810178ae6840 0000000000000286
       ffff81017ca7b400 0000000000000286 ffff81017e825e10 ffff81017560a3c0
       ffff81017e825e10 ffff810176c01000
Call Trace:<ffffffff880410c8>{:ib_ipoib:ipoib_mcast_join_complete+56}
       <ffffffff880021d8>{:ib_core:ib_unpack+200} <ffffffff88037c4c>{:ib_sa:ib_sa_mcmember_rec_callback+76}
       <ffffffff88037472>{:ib_sa:recv_handler+66} <ffffffff8801014d>{:ib_mad:ib_mad_completion_handler+957}
       <ffffffff8800fd90>{:ib_mad:ib_mad_completion_handler+0}
       <ffffffff80146fac>{worker_thread+476} <ffffffff80131150>{default_wake_function+0}
       <ffffffff8012df43>{__wake_up_common+67} <ffffffff80131150>{default_wake_function+0}
       <ffffffff8014b4b0>{keventd_create_kthread+0} <ffffffff80146dd0>{worker_thread+0}
       <ffffffff8014b4b0>{keventd_create_kthread+0} <ffffffff8014b609>{kthread+217}
       <ffffffff8010e7a6>{child_rip+8} <ffffffff8014b4b0>{keventd_create_kthread+0}
       <ffffffff8014b530>{kthread+0} <ffffffff8010e79e>{child_rip+0}

Code: 49 8b 7d 08 48 81 c7 cc 01 00 00 f3 a6 75 17 49 8b 45 70 8b
RIP <ffffffff88040154>{:ib_ipoib:ipoib_mcast_join_finish+100} RSP
<ffff810178d59c58>
CR2: 0000000000000488
 <1>Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffff88032669>{:ib_umad:send_handler+41}
PGD 0
Oops: 0000 [2] SMP
CPU 1
Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core
Pid: 12203, comm: opensm Not tainted 2.6.14 #4
RIP: 0010:[<ffffffff88032669>] <ffffffff88032669>{:ib_umad:send_handler+41}
RSP: 0018:ffff81017bae7bb8  EFLAGS: 00010296
RAX: ffff810178970d98 RBX: ffff81017bae7c78 RCX: ffff810178970d68
RDX: ffff81017bae7c68 RSI: ffff81017bae7c78 RDI: ffff81017e825c10
RBP: 0000000000000000 R08: ffff81017bae6000 R09: 0000000000000100
R10: 0000000000000000 R11: 0000000000000000 R12: ffff81017e825c10
R13: ffff810176c01000 R14: ffff81017bae7c68 R15: ffff81017bae7ef8
FS:  0000000000000000(0000) GS:ffffffff805ff880(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0
Process opensm (pid: 12203, threadinfo ffff81017bae6000, task ffff81017bbad0c0)
Stack: ffff810178e780e0 ffff81017bae7c50 ffff81017e825c10 ffff81017e825c00
       ffff81017bae7c78 ffffffff880109c6 0100000000000000 0000000000000435
       0000000000000000 ffff810152d71000
Call Trace:<ffffffff880109c6>{:ib_mad:ib_unregister_mad_agent+406}
       <ffffffff88019658>{:ib_mthca:mthca_cmd_box+72}
<ffffffff88032256>{:ib_umad:ib_umad_close+70}
       <ffffffff8017d7a2>{__fput+178} <ffffffff8017a59e>{filp_close+110}
       <ffffffff80137983>{put_files_struct+115} <ffffffff801391bb>{do_exit+507}
       <ffffffff80140c85>{__dequeue_signal+501}
<ffffffff80139cac>{do_group_exit+236}
       <ffffffff801436c7>{get_signal_to_deliver+1431}
<ffffffff8010cd8f>{do_signal+159}
       <ffffffff801427e5>{kill_proc_info+85} <ffffffff80142afc>{sys_kill+348}
       <ffffffff8010d9a7>{sysret_signal+28}
<ffffffff8010dc8f>{ptregscall_common+103}


Code: 48 8b 45 00 48 8b 78 18 e8 9a 01 fd ff 48 8b 7d 00 e8 51 ce
RIP <ffffffff88032669>{:ib_umad:send_handler+41} RSP <ffff81017bae7bb8>
CR2: 0000000000000000
 <1>Fixing recursive fault but reboot is needed!


-- 
MST



More information about the general mailing list