[ofa-general] Re: i got kernel oops in ib_umad when executing ULPs tests
Sasha Khapyorsky
sashak at voltaire.com
Wed Nov 28 07:17:45 PST 2007
Hi Dotan,
On 11:24 Tue 27 Nov , Dotan Barak wrote:
> Hi.
>
> When executing SDP tests (stress_connect) i got a kernel oops in my machine
> in ib_umad:
Is it reproducible somehow?
>
> Here are the machine props:
> *************************************************************
> Host Name : sw112/3
> Host Architecture : x86_64
> Linux Distribution: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10
> Kernel Version : 2.6.16.21-0.8-smp
> GCC Version : gcc (GCC) 4.1.0 (SUSE Linux)
> Memory size : 4049452 kB
> Number of CPUs : 4
> cpu MHz : 3192.308
> MST Version : 4.4.3
> Driver Version : ofa_1_3_dev-20071126-0855
> HCA ID(s) : mlx4_0
> HCA model(s) : 25418
> Board(s) : MT_04A0110002
> *************************************************************
>
> Here is the dump of the /var/log/messages:
> Nov 27 09:26:32 sw112 OpenSM[24713]: Exiting SM
> Nov 27 09:26:32 sw112 kernel: general protection fault: 0000 [1] SMP
> Nov 27 09:26:32 sw112 kernel: last sysfs file: /class/net/ib0/address
> Nov 27 09:26:32 sw112 kernel: CPU 2
> Nov 27 09:26:32 sw112 kernel: Modules linked in: mst_pciconf mst_pci
> rdma_ucm rds ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_c
> m ib_sa ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core memtrack
> autofs4 ipv6 nfs lockd nfs_acl sunrpc af_packet
> button battery ac apparmor aamatch_pcre loop dm_mod ide_cd uhci_hcd ehci_hcd
> cdrom shpchp pci_hotplug hw_random i8xx_tco us
> bcore e1000 ext3 jbd edd fan thermal processor sg mptspi mptscsih mptbase
> scsi_transport_spi piix sd_mod scsi_mod ide_disk i
> de_core
> Nov 27 09:26:32 sw112 kernel: Pid: 24713, comm: opensm Tainted: PF U
> 2.6.16.21-0.8-smp #1
> Nov 27 09:26:32 sw112 kernel: RIP: 0010:[<ffffffff8837d39f>]
> <ffffffff8837d39f>{:ib_umad:dequeue_send+26}
> Nov 27 09:26:32 sw112 kernel: RSP: 0018:ffff8100c0d9fde8 EFLAGS: 00010046
> Nov 27 09:26:32 sw112 kernel: RAX: ffff8100c1a95658 RBX: 3f40a6f32b5a2004
> RCX: 3f40a6f32b5a2014
> Nov 27 09:26:32 sw112 kernel: RDX: ffff8100c0d9fe58 RSI: 3f40a6f32b5a2004
> RDI: ffff81007401ac3c
> Nov 27 09:26:32 sw112 kernel: RBP: 3f40a6f32b5a2004 R08: 0000000000000206
> R09: 00000000000007d7
> Nov 27 09:26:32 sw112 kernel: R10: 0000000000000000 R11: 0000000000000246
> R12: ffff81007401ac00
> Nov 27 09:26:32 sw112 kernel: R13: ffff81007401a210 R14: 0000000000000005
> R15: 0000000000000000
> Nov 27 09:26:32 sw112 kernel: FS: 00002b13822edef0(0000)
> GS:ffff81012bd6b340(0000) knlGS:0000000000000000
> Nov 27 09:26:32 sw112 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> 000000008005003b
> Nov 27 09:26:32 sw112 kernel: CR2: 00000000005d99c0 CR3: 0000000037079000
> CR4: 00000000000006e0
> Nov 27 09:26:32 sw112 kernel: Process opensm (pid: 24713, threadinfo
> ffff8100c0d9e000, task ffff8100cd8047d0)
> Nov 27 09:26:32 sw112 kernel: Stack: ffff81012d706b10 ffff8100c0d9fe68
> ffff81007401ac00 ffffffff8837d4b1
> Nov 27 09:26:32 sw112 kernel: 0000000000000296 ffff8100c0d9fe40
> ffff81007401a210 ffff81007401a200
> Nov 27 09:26:32 sw112 kernel: 0000000000000005 ffffffff8827261e
> Nov 27 09:26:32 sw112 kernel: Call Trace:
> <ffffffff8837d4b1>{:ib_umad:send_handler+38}
> Nov 27 09:26:32 sw112 kernel:
> <ffffffff8827261e>{:ib_mad:ib_unregister_mad_agent+359}
> Nov 27 09:26:32 sw112 kernel:
> <ffffffff8837d26b>{:ib_umad:ib_umad_unreg_agent+121}
> Nov 27 09:26:32 sw112 kernel:
> <ffffffff8837db37>{:ib_umad:ib_umad_ioctl+74}
> <ffffffff8018b6b9>{do_ioctl+33}
> Nov 27 09:26:32 sw112 kernel: <ffffffff8018b94b>{vfs_ioctl+584}
> <ffffffff801e7e6b>{__up_write+33}
> Nov 27 09:26:32 sw112 kernel: <ffffffff8018b9c6>{sys_ioctl+98}
> <ffffffff8010a7be>{system_call+126}
> Nov 27 09:26:32 sw112 kernel:
> Nov 27 09:26:32 sw112 kernel: Code: 48 8b 53 10 48 8b 41 08 48 89 42 08 48
> 89 10 48 c7 41 08 00
> Nov 27 09:26:32 sw112 kernel: RIP
> <ffffffff8837d39f>{:ib_umad:dequeue_send+26} RSP <ffff8100c0d9fde8>
>
>
>
> Here is the dump of /var/log/opensm.log:
>
> Nov 27 09:26:44 546327 [D6AC7EF0] 0x03 -> OpenSM 3.1.7
> Nov 27 09:26:44 546407 [D6AC7EF0] 0x80 -> OpenSM 3.1.7
> Nov 27 09:26:44 547422 [D6AC7EF0] 0x02 -> osm_vendor_bind: Binding to port
> 0x4025
^^^^^^
Is this a valid GUID?
> Nov 27 09:26:44 673957 [D6AC7EF0] 0x01 -> osm_vendor_bind: ERR 5426: Unable
> to register class 129 version 1
> Nov 27 09:26:44 674032 [D6AC7EF0] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118:
> Vendor specific bind failed
> Nov 27 09:26:44 674057 [D6AC7EF0] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD
> Controller bind failed (IB_ERROR)
> Nov 27 09:26:44 674089 [D6AC7EF0] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11:
> No previous bind
> Nov 27 09:26:44 675165 [D6AC7EF0] 0x80 -> Exiting SM
>
>
> can you check this issue?
Could you send OpenSM log file too?
Sasha
More information about the general
mailing list