[ofa-general] Re: i got kernel oops in ib_umad when executing ULPs tests

Sasha Khapyorsky sashak at voltaire.com
Wed Nov 28 07:17:45 PST 2007


Hi Dotan,

On 11:24 Tue 27 Nov     , Dotan Barak wrote:
>  Hi.
> 
>  When executing SDP tests (stress_connect) i got a kernel oops in my machine 
>  in ib_umad:

Is it reproducible somehow?

> 
>  Here are the machine props:
>  *************************************************************
>  Host Name         : sw112/3
>  Host Architecture : x86_64
>  Linux Distribution: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10
>  Kernel Version    : 2.6.16.21-0.8-smp
>  GCC Version       : gcc (GCC) 4.1.0 (SUSE Linux)
>  Memory size       : 4049452 kB
>  Number of CPUs    : 4
>  cpu MHz           : 3192.308
>  MST Version       : 4.4.3
>  Driver Version    : ofa_1_3_dev-20071126-0855
>  HCA ID(s)         : mlx4_0
>  HCA model(s)      : 25418
>  Board(s)          : MT_04A0110002
>  *************************************************************
> 
>  Here is the dump of the /var/log/messages:
>  Nov 27 09:26:32 sw112 OpenSM[24713]: Exiting SM
>  Nov 27 09:26:32 sw112 kernel: general protection fault: 0000 [1] SMP
>  Nov 27 09:26:32 sw112 kernel: last sysfs file: /class/net/ib0/address
>  Nov 27 09:26:32 sw112 kernel: CPU 2
>  Nov 27 09:26:32 sw112 kernel: Modules linked in: mst_pciconf mst_pci 
>  rdma_ucm rds ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_c
>  m ib_sa ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core memtrack 
>  autofs4 ipv6 nfs lockd nfs_acl sunrpc af_packet
>  button battery ac apparmor aamatch_pcre loop dm_mod ide_cd uhci_hcd ehci_hcd 
>  cdrom shpchp pci_hotplug hw_random i8xx_tco us
>  bcore e1000 ext3 jbd edd fan thermal processor sg mptspi mptscsih mptbase 
>  scsi_transport_spi piix sd_mod scsi_mod ide_disk i
>  de_core
>  Nov 27 09:26:32 sw112 kernel: Pid: 24713, comm: opensm Tainted: PF    U 
>  2.6.16.21-0.8-smp #1
>  Nov 27 09:26:32 sw112 kernel: RIP: 0010:[<ffffffff8837d39f>] 
>  <ffffffff8837d39f>{:ib_umad:dequeue_send+26}
>  Nov 27 09:26:32 sw112 kernel: RSP: 0018:ffff8100c0d9fde8  EFLAGS: 00010046
>  Nov 27 09:26:32 sw112 kernel: RAX: ffff8100c1a95658 RBX: 3f40a6f32b5a2004 
>  RCX: 3f40a6f32b5a2014
>  Nov 27 09:26:32 sw112 kernel: RDX: ffff8100c0d9fe58 RSI: 3f40a6f32b5a2004 
>  RDI: ffff81007401ac3c
>  Nov 27 09:26:32 sw112 kernel: RBP: 3f40a6f32b5a2004 R08: 0000000000000206 
>  R09: 00000000000007d7
>  Nov 27 09:26:32 sw112 kernel: R10: 0000000000000000 R11: 0000000000000246 
>  R12: ffff81007401ac00
>  Nov 27 09:26:32 sw112 kernel: R13: ffff81007401a210 R14: 0000000000000005 
>  R15: 0000000000000000
>  Nov 27 09:26:32 sw112 kernel: FS:  00002b13822edef0(0000) 
>  GS:ffff81012bd6b340(0000) knlGS:0000000000000000
>  Nov 27 09:26:32 sw112 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
>  000000008005003b
>  Nov 27 09:26:32 sw112 kernel: CR2: 00000000005d99c0 CR3: 0000000037079000 
>  CR4: 00000000000006e0
>  Nov 27 09:26:32 sw112 kernel: Process opensm (pid: 24713, threadinfo 
>  ffff8100c0d9e000, task ffff8100cd8047d0)
>  Nov 27 09:26:32 sw112 kernel: Stack: ffff81012d706b10 ffff8100c0d9fe68 
>  ffff81007401ac00 ffffffff8837d4b1
>  Nov 27 09:26:32 sw112 kernel:        0000000000000296 ffff8100c0d9fe40 
>  ffff81007401a210 ffff81007401a200
>  Nov 27 09:26:32 sw112 kernel:        0000000000000005 ffffffff8827261e
>  Nov 27 09:26:32 sw112 kernel: Call Trace: 
>  <ffffffff8837d4b1>{:ib_umad:send_handler+38}
>  Nov 27 09:26:32 sw112 kernel:        
>  <ffffffff8827261e>{:ib_mad:ib_unregister_mad_agent+359}
>  Nov 27 09:26:32 sw112 kernel:        
>  <ffffffff8837d26b>{:ib_umad:ib_umad_unreg_agent+121}
>  Nov 27 09:26:32 sw112 kernel:        
>  <ffffffff8837db37>{:ib_umad:ib_umad_ioctl+74} 
>  <ffffffff8018b6b9>{do_ioctl+33}
>  Nov 27 09:26:32 sw112 kernel:        <ffffffff8018b94b>{vfs_ioctl+584} 
>  <ffffffff801e7e6b>{__up_write+33}
>  Nov 27 09:26:32 sw112 kernel:        <ffffffff8018b9c6>{sys_ioctl+98} 
>  <ffffffff8010a7be>{system_call+126}
>  Nov 27 09:26:32 sw112 kernel:
>  Nov 27 09:26:32 sw112 kernel: Code: 48 8b 53 10 48 8b 41 08 48 89 42 08 48 
>  89 10 48 c7 41 08 00
>  Nov 27 09:26:32 sw112 kernel: RIP 
>  <ffffffff8837d39f>{:ib_umad:dequeue_send+26} RSP <ffff8100c0d9fde8>
> 
> 
> 
>  Here is the dump of /var/log/opensm.log:
> 
>  Nov 27 09:26:44 546327 [D6AC7EF0] 0x03 -> OpenSM 3.1.7
>  Nov 27 09:26:44 546407 [D6AC7EF0] 0x80 -> OpenSM 3.1.7
>  Nov 27 09:26:44 547422 [D6AC7EF0] 0x02 -> osm_vendor_bind: Binding to port 
>  0x4025
   ^^^^^^
Is this a valid GUID?

>  Nov 27 09:26:44 673957 [D6AC7EF0] 0x01 -> osm_vendor_bind: ERR 5426: Unable 
>  to register class 129 version 1
>  Nov 27 09:26:44 674032 [D6AC7EF0] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: 
>  Vendor specific bind failed
>  Nov 27 09:26:44 674057 [D6AC7EF0] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD 
>  Controller bind failed (IB_ERROR)
>  Nov 27 09:26:44 674089 [D6AC7EF0] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: 
>  No previous bind
>  Nov 27 09:26:44 675165 [D6AC7EF0] 0x80 -> Exiting SM
> 
> 
>  can you check this issue?

Could you send OpenSM log file too?

Sasha



More information about the general mailing list