[openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

Don.Albert at Bull.com Don.Albert at Bull.com
Fri May 26 14:32:16 PDT 2006


Hal,

I rebuilt the opensm executable with the patch you provided.   The patch 
fixes (or avoids) the segmentation fault and opensm comes up and runs. 
However, the link is still not becoming operational.   On the local side 
it goes to ARMED,  and on the remote side it goes to INIT.   The osm.log 
seems to show that the MAD packets are timing out.  Here is the first part 
of the file, it just repeats after this at one minute intervals.

[koa] (ib) root> cat /var/log/osm.log
May 26 14:05:43 369104 [8EFC3D00] -> OpenSM Rev:openib-1.2.0 OpenIB svn 
Exported revision
May 26 14:05:43 369260 [0000] -> OpenSM Rev:openib-1.2.0 OpenIB svn 
Exported revision

May 26 14:05:43 370571 [8EFC3D00] -> osm_report_notice: Reporting Generic 
Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
0,0x0000000000000000
May 26 14:05:43 370631 [8EFC3D00] -> osm_report_notice: Reporting Generic 
Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
0,0x0000000000000000
May 26 14:05:43 373005 [8EFC3D00] -> osm_vendor_bind: Binding to port 
0x2c90200216dc5
May 26 14:05:43 374685 [8EFC3D00] -> osm_vendor_bind: Binding to port 
0x2c90200216dc5
May 26 14:05:44 172028 [44007960] -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x11 trans_id=0x1239) -- dr
opping
May 26 14:05:44 172070 [44007960] -> umad_receiver: ERR 5411: DR SMP
May 26 14:05:44 172083 [44007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
3113: MAD completed in error (IB_TIMEOUT)
May 26 14:05:44 172148 [44007960] -> SMP dump:
                                base_ver................0x1
                                mgmt_class..............0x81
                                class_ver...............0x1
                                method..................0x1 (SubnGet)
                                D bit...................0x0
                                status..................0x0
                                hop_ptr.................0x0
                                hop_count...............0x1
                                trans_id................0x1239
                                attr_id.................0x11 (NodeInfo)
                                resv....................0x0
                                attr_mod................0x0
                                m_key...................0x0000000000000000
                                dr_slid.................0xFFFF
                                dr_dlid.................0xFFFF

                                Initial path: [0][1]
                                Return path:  [0][0]
                                Reserved:     [0][0][0][0][0][0][0]

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

May 26 14:05:44 172199 [42003960] -> osm_drop_mgr_process: ERR 0108: 
Unknown remote side for node 0x0002c90200216dc4 port 1. Adding
to light sweep sampling list
May 26 14:05:44 172240 [42003960] -> Directed Path Dump of 0 hop path:
                                Path = [0]
May 26 14:05:44 172256 [0000] -> Entering MASTER state

May 26 14:05:44 179081 [0000] -> SUBNET UP

May 26 14:05:54 180461 [44007960] -> umad_receiver: ERR 5409: send 
completed with error (method=0x1 attr=0x11 trans_id=0x1240) -- dr
opping
May 26 14:05:54 180515 [44007960] -> umad_receiver: ERR 5411: DR SMP
May 26 14:05:54 180528 [44007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 
3113: MAD completed in error (IB_TIMEOUT)
May 26 14:05:54 180569 [44007960] -> SMP dump:
                                base_ver................0x1
                                mgmt_class..............0x81
                                class_ver...............0x1
                                method..................0x1 (SubnGet)
                                D bit...................0x0
                                status..................0x0
                                hop_ptr.................0x0
                                hop_count...............0x1
                                trans_id................0x1240
                                attr_id.................0x11 (NodeInfo)
                                resv....................0x0
                                attr_mod................0x0
                                m_key...................0x0000000000000000
                                dr_slid.................0xFFFF
                                dr_dlid.................0xFFFF

                                Initial path: [0][1]
                                Return path:  [0][0]
                                Reserved:     [0][0][0][0][0][0][0]

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

May 26 14:05:54 180624 [42003960] -> osm_drop_mgr_process: ERR 0108: 
Unknown remote side for node 0x0002c90200216dc4 port 1. Adding
to light sweep sampling list
May 26 14:05:54 180649 [42003960] -> Directed Path Dump of 0 hop path:
                                Path = [0]


The physical link appears to be up:  here are the ibstat, ibstatus results 
for both sides:

Local system

[koa] (ib) root> ibstat
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.0.800
        Hardware version: a0
        Node GUID: 0x0002c90200216dc4
        System image GUID: 0x0002c90200216dc7
        Port 1:
                State: Armed
                Physical state: LinkUp
                Rate: 20
                Base lid: 2
                LMC: 0
                SM lid: 2
                Capability mask: 0x02510a6a
                Port GUID: 0x0002c90200216dc5
[koa] (ib) root> ibstatus
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c902:0021:6dc5
        base lid:        0x2
        sm lid:          0x2
        state:           3: ARMED
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)

Remote system

[jatoba] (ib) ib> ibstat
CA 'mthca0'
        CA type: MT25204
        Number of ports: 1
        Firmware version: 1.0.800
        Hardware version: a0
        Node GUID: 0x0002c90200216e40
        System image GUID: 0x0002c90200216e43
        Port 1:
                State: Initializing
                Physical state: LinkUp
                Rate: 20
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510a68
                Port GUID: 0x0002c90200216e41
[jatoba] (ib) ib> ibstatus
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c902:0021:6e41
        base lid:        0x0
        sm lid:          0x0
        state:           2: INIT
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)

An "ibnetdiscover" on the local system gives the following:

[koa] (ib) root> ibnetdiscover
ibwarn: [20638] handle_port: NodeInfo on DR path [0][1] port 1 failed, 
skipping port
#
# Topology file: generated on Fri May 26 14:24:20 2006
#
# Max of 1 hops discovered
# Initiated from node 0002c90200216dc4 port 0002c90200216dc5

vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200216dc7
caguid=0x2c90200216dc4
Ca      1 "H-0002c90200216dc4"          # koa HCA-1

What next, coach?

  -Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060526/293ff223/attachment.html>


More information about the ewg mailing list