[ofa-general] soft lockup in the kernel mad layer

Or Gerlitz ogerlitz at voltaire.com
Sun Jun 29 05:10:27 PDT 2008


doing some tests against some nodes with new HCA firmware (connectx FW 2.5) which
seems to be very slow responding on node info queries, I think that I have stepped
on a bug/s in the kernel mad code, for example, following this error report from
the smpquery diag, the process gets into the D state with the below trace:

# smpquery pkeys 11
ibwarn: [403] _do_madrpc: recv failed: Connection timed out
smpquery: iberror: failed: operation pkeys: node info query failed


smpquery      D ffff8100010096f0     0  5017   4694                5004
(L-TLB)ffff81007ce0dd68 0000000000000046 0000000000000000 0000000000000009
       ffff81007e9c29d8 ffff81007e9c2790 ffff810031332080 0000004e66fa4c1b
       0000000000046368 000000003d97ed54
Call Trace: <ffffffff8023c476>{n_tty_receive_buf+3441}
	<ffffffff802d923d>{wait_for_completion+135}
	<ffffffff8012aee7>{default_wake_function+0}
	<ffffffff881f048b>{:ib_mad:ib_cancel_rmpp_recvs+131}
	<ffffffff881ed5d6>{:ib_mad:ib_unregister_mad_agent+866}
	<ffffffff882c310f>{:ib_umad:ib_umad_close+185}
	<ffffffff8018244e>{__fput+174}
	<ffffffff8017fb8b>{filp_close+89}
	<ffffffff80134384>{put_files_struct+108}
	<ffffffff8013542f>{do_exit+641}
	<ffffffff80135ade>{sys_exit_group+0}
	<ffffffff8010ad3e>{system_call+126}

anyone has an insight here?

The IB bits used on this node are not the mainline kernel ones but rather
git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
commit 564e9e9383272f4311fd87ff4e5447cfcebad73a

Or.

after this lockup happens to one process, any other process attempting to close
its handle to the mad layer or open a new gets hang, is this a second
issue or expected under this scheme:

opensm        D ffff810040025c80     0   395   6389           388
(L-TLB)ffff810075187c28 0000000000000046 00000001ffffffff 0000000000000009
       ffff81007eecea98 ffff81007eece850 ffff81007ef6d100 00000a6ca71f1ef1
       00000000000585bb 0000000300000282
Call Trace: <ffffffff802da0ee>{__down_write+130}
	<ffffffff882c307b>{:ib_umad:ib_umad_close+37}
       <ffffffff8018244e>{__fput+174} <ffffffff8017fb8b>{filp_close+89}
       <ffffffff80134384>{put_files_struct+108}
	<ffffffff8013542f>{do_exit+641}
       <ffffffff80135ade>{sys_exit_group+0}
	<ffffffff8013e4bd>{get_signal_to_deliver+1374}
       <ffffffff8010a12f>{do_signal+109}
	<ffffffff8012aee7>{default_wake_function+0}
       <ffffffff8010adc7>{sysret_signal+28}
	<ffffffff8010b04b>{ptregscall_common+103}




More information about the general mailing list