[openib-general] Re: IBM eHCA testing..

Troy Benjegerdes troy at scl.ameslab.gov
Thu Oct 13 15:46:47 PDT 2005


On Wed, Oct 12, 2005 at 01:04:37PM +0200, IBMEHCA DD wrote:
> I just released the ehca2_0028 which uses svn 3615 on 
> https://sourceforge.net/projects/ibmehcad/
> As you might notice the license already has changed to the openib.org 
> license.
> 
> With 2.6.13 we had the non-issue that our maun focus was on 2.6.5-7.191 
> and we're only now moving to the latest kernel.

I just built against svn 3774, and 2.6.13.3, with the timeout set to 120
seconds. There's some bad interaction going on with OpenSM.

p5l2:~# modprobe hcad_mod ehca_nr_ports=1
[ 6186.855237] eBus Device Driver
[ 6186.907578] eHCA Infiniband Device Driver (Rel.: EHCA2_0028)
[ 6186.912203] xics_enable_irq: irq=36868: ibm_int_on returned fffffffd
p5l2:~# modprobe ib_ipoib
****hang for awhile.. entries appear in osm.log ***
[ 6309.683651] PU0003 00060103:ehca_parse_ec  EHCA port 1 is available.
[ 6310.253303] kernel BUG in dma_map_single at arch/ppc64/kernel/dma.c:86!
[ 6310.253320] Oops: Exception in kernel mode, sig: 5 [#1]
[ 6310.253339] SMP NR_CPUS=8 NUMA PSERIES LPAR
[ 6310.253364] Modules linked in: ib_mad hcad_mod ib_core ebus
[ 6310.253383] NIP: C00000000000FA10 XER: 00000020 LR: C00000000000F9B0 CTR: C00000000000F980
[ 6310.253400] REGS: c00000000f3bb770 TRAP: 0700   Not tainted (2.6.13.3-power5)
[ 6310.253421] MSR: 8000000000029032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 24002444
[ 6310.253436] DAR: 0000000000000000 DSISR: 0000000000000000
[ 6310.253471] TASK: c00000000209f060[1874] 'modprobe' THREAD: c00000000f3b8000CPU: 7
[ 6310.253492] GPR00: C0000000004B3660 C00000000F3BB9F0 C0000000005EE948 C0000001DBEC5C18
[ 6310.253513] GPR04: C0000003CB5B1D0C 0000000000000128 0000000000000002 0000000000000008
[ 6310.253532] GPR08: C0000003CBD5EEE8 0000000000000000 C00000000F67FC00 C00000000000F980
[ 6310.253553] GPR12: D0000000000621D0 C0000000004B7800 0000000010017078 0000000000000000
[ 6310.253609] GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000000001
[ 6310.253665] GPR20: C000000008DE7800 0000000000000002 0000000000000001 C00000000F67FDC8
[ 6310.253688] GPR24: C00000000F67FD40 0000000000000002 C0000001DBEC5C18 0000000000000002
[ 6310.253708] GPR28: 0000000000000128 C0000003CB5B1D0C D00000000006EB00 C0000003CB5B1C80
[ 6310.253731] NIP [c00000000000fa10] .dma_map_single+0x90/0xc0
[ 6310.253753] LR [c00000000000f9b0] .dma_map_single+0x30/0xc0
[ 6310.253778] Call Trace:
[ 6310.253797] [c00000000f3bb9f0] [c000000008de7800] 0xc000000008de7800 (unreliable)
[ 6310.253838] [c00000000f3bba90] [d00000000005aee8] .ib_mad_post_receive_mads+0xb8/0x270 [ib_mad]
[ 6310.253880] [c00000000f3bbb80] [d00000000005c840] .ib_mad_init_device+0x350/0x660 [ib_mad]
[ 6310.253905] [c00000000f3bbc70] [d00000000004d0bc] .ib_register_client+0xdc/0x150 [ib_core]
[ 6310.253936] [c00000000f3bbd00] [d000000000061e6c] .ib_mad_init_module+0x8c/0xf0 [ib_mad]
[ 6310.253999] [c00000000f3bbd90] [c000000000070720] .sys_init_module+0x1e0/0x4d0
[ 6310.254030] [c00000000f3bbe30] [c00000000000d300] syscall_exit+0x0/0x18
[ 6310.254045] Instruction dump:
[ 6310.254053] 4e800421 e8410028 382100a0 e8010010 eb41ffd0 eb61ffd8 eb81ffe0 eba1ffe8
[ 6310.254089] 7c0803a6 4e800020 60000000 60000000 <0fe00000> 382100a0 38600000e8010010
[ 6310.254206]  Segmentation fault

I'm also attaching part of an opensm log file.

(the full copy is at http://scl.ameslab.gov/~troy/osm-ehca.log )

The IBM galaxy adapters are at:
	Initial path: [0][1][16]
	Initial path: [0][1][13]

-------------- next part --------------
				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

Oct 13 10:42:05 978875 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=16) -- dropping.
Oct 13 10:42:05 978883 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 2 DR SLID 0x0 DR DLID 0x0
Oct 13 10:42:05 978892 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT).
Oct 13 10:42:05 978925 [42FFF970] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x2
				trans_id................0x1810
				attr_id.................0x16 (P_KeyTable)
				resv....................0x0
				attr_mod................0x3E0000
				m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][1][16]
				Return path:  [0][0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

Oct 13 10:42:06 378879 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=16) -- dropping.
Oct 13 10:42:06 378891 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 2 DR SLID 0x0 DR DLID 0x0
Oct 13 10:42:06 378900 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT).
Oct 13 10:42:06 378934 [42FFF970] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x2
				trans_id................0x1811
				attr_id.................0x16 (P_KeyTable)
				resv....................0x0
				attr_mod................0x3F0000
				m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][1][16]
				Return path:  [0][0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

Oct 13 10:42:06 806879 [42FFF970] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=16) -- dropping.
Oct 13 10:42:06 806887 [42FFF970] -> umad_receiver: ERR 5411: DR SMP hop ptr 0 hop count 3 DR SLID 0x0 DR DLID 0x0
Oct 13 10:42:06 806896 [42FFF970] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT).
Oct 13 10:42:06 806930 [42FFF970] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x3
				trans_id................0x1835
				attr_id.................0x16 (P_KeyTable)
				resv....................0x0
				attr_mod................0x10000
				m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][1][16][2]
				Return path:  [0][0][0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00



More information about the general mailing list