[ofa-general] soft lockup in the kernel mad layer
Or Gerlitz
ogerlitz at voltaire.com
Sun Jun 29 05:10:27 PDT 2008
doing some tests against some nodes with new HCA firmware (connectx FW 2.5) which
seems to be very slow responding on node info queries, I think that I have stepped
on a bug/s in the kernel mad code, for example, following this error report from
the smpquery diag, the process gets into the D state with the below trace:
# smpquery pkeys 11
ibwarn: [403] _do_madrpc: recv failed: Connection timed out
smpquery: iberror: failed: operation pkeys: node info query failed
smpquery D ffff8100010096f0 0 5017 4694 5004
(L-TLB)ffff81007ce0dd68 0000000000000046 0000000000000000 0000000000000009
ffff81007e9c29d8 ffff81007e9c2790 ffff810031332080 0000004e66fa4c1b
0000000000046368 000000003d97ed54
Call Trace: <ffffffff8023c476>{n_tty_receive_buf+3441}
<ffffffff802d923d>{wait_for_completion+135}
<ffffffff8012aee7>{default_wake_function+0}
<ffffffff881f048b>{:ib_mad:ib_cancel_rmpp_recvs+131}
<ffffffff881ed5d6>{:ib_mad:ib_unregister_mad_agent+866}
<ffffffff882c310f>{:ib_umad:ib_umad_close+185}
<ffffffff8018244e>{__fput+174}
<ffffffff8017fb8b>{filp_close+89}
<ffffffff80134384>{put_files_struct+108}
<ffffffff8013542f>{do_exit+641}
<ffffffff80135ade>{sys_exit_group+0}
<ffffffff8010ad3e>{system_call+126}
anyone has an insight here?
The IB bits used on this node are not the mainline kernel ones but rather
git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
commit 564e9e9383272f4311fd87ff4e5447cfcebad73a
Or.
after this lockup happens to one process, any other process attempting to close
its handle to the mad layer or open a new gets hang, is this a second
issue or expected under this scheme:
opensm D ffff810040025c80 0 395 6389 388
(L-TLB)ffff810075187c28 0000000000000046 00000001ffffffff 0000000000000009
ffff81007eecea98 ffff81007eece850 ffff81007ef6d100 00000a6ca71f1ef1
00000000000585bb 0000000300000282
Call Trace: <ffffffff802da0ee>{__down_write+130}
<ffffffff882c307b>{:ib_umad:ib_umad_close+37}
<ffffffff8018244e>{__fput+174} <ffffffff8017fb8b>{filp_close+89}
<ffffffff80134384>{put_files_struct+108}
<ffffffff8013542f>{do_exit+641}
<ffffffff80135ade>{sys_exit_group+0}
<ffffffff8013e4bd>{get_signal_to_deliver+1374}
<ffffffff8010a12f>{do_signal+109}
<ffffffff8012aee7>{default_wake_function+0}
<ffffffff8010adc7>{sysret_signal+28}
<ffffffff8010b04b>{ptregscall_common+103}
More information about the general
mailing list