[ofa-general] i got kernel oops in ib_umad when executing ULPs tests

Dotan Barak dotanb at dev.mellanox.co.il
Tue Nov 27 01:24:07 PST 2007


Hi.

When executing SDP tests (stress_connect) i got a kernel oops in my 
machine in ib_umad:

Here are the machine props:
*************************************************************
Host Name         : sw112/3
Host Architecture : x86_64
Linux Distribution: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10
Kernel Version    : 2.6.16.21-0.8-smp
GCC Version       : gcc (GCC) 4.1.0 (SUSE Linux)
Memory size       : 4049452 kB
Number of CPUs    : 4
cpu MHz           : 3192.308
MST Version       : 4.4.3
Driver Version    : ofa_1_3_dev-20071126-0855
HCA ID(s)         : mlx4_0
HCA model(s)      : 25418
Board(s)          : MT_04A0110002
*************************************************************

Here is the dump of the /var/log/messages:
Nov 27 09:26:32 sw112 OpenSM[24713]: Exiting SM
Nov 27 09:26:32 sw112 kernel: general protection fault: 0000 [1] SMP
Nov 27 09:26:32 sw112 kernel: last sysfs file: /class/net/ib0/address
Nov 27 09:26:32 sw112 kernel: CPU 2
Nov 27 09:26:32 sw112 kernel: Modules linked in: mst_pciconf mst_pci 
rdma_ucm rds ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_c
m ib_sa ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core 
memtrack autofs4 ipv6 nfs lockd nfs_acl sunrpc af_packet
 button battery ac apparmor aamatch_pcre loop dm_mod ide_cd uhci_hcd 
ehci_hcd cdrom shpchp pci_hotplug hw_random i8xx_tco us
bcore e1000 ext3 jbd edd fan thermal processor sg mptspi mptscsih 
mptbase scsi_transport_spi piix sd_mod scsi_mod ide_disk i
de_core
Nov 27 09:26:32 sw112 kernel: Pid: 24713, comm: opensm Tainted: PF    U 
2.6.16.21-0.8-smp #1
Nov 27 09:26:32 sw112 kernel: RIP: 0010:[<ffffffff8837d39f>] 
<ffffffff8837d39f>{:ib_umad:dequeue_send+26}
Nov 27 09:26:32 sw112 kernel: RSP: 0018:ffff8100c0d9fde8  EFLAGS: 00010046
Nov 27 09:26:32 sw112 kernel: RAX: ffff8100c1a95658 RBX: 
3f40a6f32b5a2004 RCX: 3f40a6f32b5a2014
Nov 27 09:26:32 sw112 kernel: RDX: ffff8100c0d9fe58 RSI: 
3f40a6f32b5a2004 RDI: ffff81007401ac3c
Nov 27 09:26:32 sw112 kernel: RBP: 3f40a6f32b5a2004 R08: 
0000000000000206 R09: 00000000000007d7
Nov 27 09:26:32 sw112 kernel: R10: 0000000000000000 R11: 
0000000000000246 R12: ffff81007401ac00
Nov 27 09:26:32 sw112 kernel: R13: ffff81007401a210 R14: 
0000000000000005 R15: 0000000000000000
Nov 27 09:26:32 sw112 kernel: FS:  00002b13822edef0(0000) 
GS:ffff81012bd6b340(0000) knlGS:0000000000000000
Nov 27 09:26:32 sw112 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Nov 27 09:26:32 sw112 kernel: CR2: 00000000005d99c0 CR3: 
0000000037079000 CR4: 00000000000006e0
Nov 27 09:26:32 sw112 kernel: Process opensm (pid: 24713, threadinfo 
ffff8100c0d9e000, task ffff8100cd8047d0)
Nov 27 09:26:32 sw112 kernel: Stack: ffff81012d706b10 ffff8100c0d9fe68 
ffff81007401ac00 ffffffff8837d4b1
Nov 27 09:26:32 sw112 kernel:        0000000000000296 ffff8100c0d9fe40 
ffff81007401a210 ffff81007401a200
Nov 27 09:26:32 sw112 kernel:        0000000000000005 ffffffff8827261e
Nov 27 09:26:32 sw112 kernel: Call Trace: 
<ffffffff8837d4b1>{:ib_umad:send_handler+38}
Nov 27 09:26:32 sw112 kernel:        
<ffffffff8827261e>{:ib_mad:ib_unregister_mad_agent+359}
Nov 27 09:26:32 sw112 kernel:        
<ffffffff8837d26b>{:ib_umad:ib_umad_unreg_agent+121}
Nov 27 09:26:32 sw112 kernel:        
<ffffffff8837db37>{:ib_umad:ib_umad_ioctl+74} 
<ffffffff8018b6b9>{do_ioctl+33}
Nov 27 09:26:32 sw112 kernel:        <ffffffff8018b94b>{vfs_ioctl+584} 
<ffffffff801e7e6b>{__up_write+33}
Nov 27 09:26:32 sw112 kernel:        <ffffffff8018b9c6>{sys_ioctl+98} 
<ffffffff8010a7be>{system_call+126}
Nov 27 09:26:32 sw112 kernel:
Nov 27 09:26:32 sw112 kernel: Code: 48 8b 53 10 48 8b 41 08 48 89 42 08 
48 89 10 48 c7 41 08 00
Nov 27 09:26:32 sw112 kernel: RIP 
<ffffffff8837d39f>{:ib_umad:dequeue_send+26} RSP <ffff8100c0d9fde8>



Here is the dump of /var/log/opensm.log:

Nov 27 09:26:44 546327 [D6AC7EF0] 0x03 -> OpenSM 3.1.7
Nov 27 09:26:44 546407 [D6AC7EF0] 0x80 -> OpenSM 3.1.7
Nov 27 09:26:44 547422 [D6AC7EF0] 0x02 -> osm_vendor_bind: Binding to 
port 0x4025
Nov 27 09:26:44 673957 [D6AC7EF0] 0x01 -> osm_vendor_bind: ERR 5426: 
Unable to register class 129 version 1
Nov 27 09:26:44 674032 [D6AC7EF0] 0x01 -> osm_sm_mad_ctrl_bind: ERR 
3118: Vendor specific bind failed
Nov 27 09:26:44 674057 [D6AC7EF0] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD 
Controller bind failed (IB_ERROR)
Nov 27 09:26:44 674089 [D6AC7EF0] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 
1A11: No previous bind
Nov 27 09:26:44 675165 [D6AC7EF0] 0x80 -> Exiting SM


can you check this issue?

thanks
Dotan



More information about the general mailing list