[ofa-general] i got kernel oops in ib_umad when executing ULPs tests
Dotan Barak
dotanb at dev.mellanox.co.il
Tue Nov 27 01:24:07 PST 2007
Hi.
When executing SDP tests (stress_connect) i got a kernel oops in my
machine in ib_umad:
Here are the machine props:
*************************************************************
Host Name : sw112/3
Host Architecture : x86_64
Linux Distribution: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10
Kernel Version : 2.6.16.21-0.8-smp
GCC Version : gcc (GCC) 4.1.0 (SUSE Linux)
Memory size : 4049452 kB
Number of CPUs : 4
cpu MHz : 3192.308
MST Version : 4.4.3
Driver Version : ofa_1_3_dev-20071126-0855
HCA ID(s) : mlx4_0
HCA model(s) : 25418
Board(s) : MT_04A0110002
*************************************************************
Here is the dump of the /var/log/messages:
Nov 27 09:26:32 sw112 OpenSM[24713]: Exiting SM
Nov 27 09:26:32 sw112 kernel: general protection fault: 0000 [1] SMP
Nov 27 09:26:32 sw112 kernel: last sysfs file: /class/net/ib0/address
Nov 27 09:26:32 sw112 kernel: CPU 2
Nov 27 09:26:32 sw112 kernel: Modules linked in: mst_pciconf mst_pci
rdma_ucm rds ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_c
m ib_sa ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core
memtrack autofs4 ipv6 nfs lockd nfs_acl sunrpc af_packet
button battery ac apparmor aamatch_pcre loop dm_mod ide_cd uhci_hcd
ehci_hcd cdrom shpchp pci_hotplug hw_random i8xx_tco us
bcore e1000 ext3 jbd edd fan thermal processor sg mptspi mptscsih
mptbase scsi_transport_spi piix sd_mod scsi_mod ide_disk i
de_core
Nov 27 09:26:32 sw112 kernel: Pid: 24713, comm: opensm Tainted: PF U
2.6.16.21-0.8-smp #1
Nov 27 09:26:32 sw112 kernel: RIP: 0010:[<ffffffff8837d39f>]
<ffffffff8837d39f>{:ib_umad:dequeue_send+26}
Nov 27 09:26:32 sw112 kernel: RSP: 0018:ffff8100c0d9fde8 EFLAGS: 00010046
Nov 27 09:26:32 sw112 kernel: RAX: ffff8100c1a95658 RBX:
3f40a6f32b5a2004 RCX: 3f40a6f32b5a2014
Nov 27 09:26:32 sw112 kernel: RDX: ffff8100c0d9fe58 RSI:
3f40a6f32b5a2004 RDI: ffff81007401ac3c
Nov 27 09:26:32 sw112 kernel: RBP: 3f40a6f32b5a2004 R08:
0000000000000206 R09: 00000000000007d7
Nov 27 09:26:32 sw112 kernel: R10: 0000000000000000 R11:
0000000000000246 R12: ffff81007401ac00
Nov 27 09:26:32 sw112 kernel: R13: ffff81007401a210 R14:
0000000000000005 R15: 0000000000000000
Nov 27 09:26:32 sw112 kernel: FS: 00002b13822edef0(0000)
GS:ffff81012bd6b340(0000) knlGS:0000000000000000
Nov 27 09:26:32 sw112 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Nov 27 09:26:32 sw112 kernel: CR2: 00000000005d99c0 CR3:
0000000037079000 CR4: 00000000000006e0
Nov 27 09:26:32 sw112 kernel: Process opensm (pid: 24713, threadinfo
ffff8100c0d9e000, task ffff8100cd8047d0)
Nov 27 09:26:32 sw112 kernel: Stack: ffff81012d706b10 ffff8100c0d9fe68
ffff81007401ac00 ffffffff8837d4b1
Nov 27 09:26:32 sw112 kernel: 0000000000000296 ffff8100c0d9fe40
ffff81007401a210 ffff81007401a200
Nov 27 09:26:32 sw112 kernel: 0000000000000005 ffffffff8827261e
Nov 27 09:26:32 sw112 kernel: Call Trace:
<ffffffff8837d4b1>{:ib_umad:send_handler+38}
Nov 27 09:26:32 sw112 kernel:
<ffffffff8827261e>{:ib_mad:ib_unregister_mad_agent+359}
Nov 27 09:26:32 sw112 kernel:
<ffffffff8837d26b>{:ib_umad:ib_umad_unreg_agent+121}
Nov 27 09:26:32 sw112 kernel:
<ffffffff8837db37>{:ib_umad:ib_umad_ioctl+74}
<ffffffff8018b6b9>{do_ioctl+33}
Nov 27 09:26:32 sw112 kernel: <ffffffff8018b94b>{vfs_ioctl+584}
<ffffffff801e7e6b>{__up_write+33}
Nov 27 09:26:32 sw112 kernel: <ffffffff8018b9c6>{sys_ioctl+98}
<ffffffff8010a7be>{system_call+126}
Nov 27 09:26:32 sw112 kernel:
Nov 27 09:26:32 sw112 kernel: Code: 48 8b 53 10 48 8b 41 08 48 89 42 08
48 89 10 48 c7 41 08 00
Nov 27 09:26:32 sw112 kernel: RIP
<ffffffff8837d39f>{:ib_umad:dequeue_send+26} RSP <ffff8100c0d9fde8>
Here is the dump of /var/log/opensm.log:
Nov 27 09:26:44 546327 [D6AC7EF0] 0x03 -> OpenSM 3.1.7
Nov 27 09:26:44 546407 [D6AC7EF0] 0x80 -> OpenSM 3.1.7
Nov 27 09:26:44 547422 [D6AC7EF0] 0x02 -> osm_vendor_bind: Binding to
port 0x4025
Nov 27 09:26:44 673957 [D6AC7EF0] 0x01 -> osm_vendor_bind: ERR 5426:
Unable to register class 129 version 1
Nov 27 09:26:44 674032 [D6AC7EF0] 0x01 -> osm_sm_mad_ctrl_bind: ERR
3118: Vendor specific bind failed
Nov 27 09:26:44 674057 [D6AC7EF0] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD
Controller bind failed (IB_ERROR)
Nov 27 09:26:44 674089 [D6AC7EF0] 0x01 -> osm_sa_mad_ctrl_unbind: ERR
1A11: No previous bind
Nov 27 09:26:44 675165 [D6AC7EF0] 0x80 -> Exiting SM
can you check this issue?
thanks
Dotan
More information about the general
mailing list