[openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

Hal Rosenstock halr at voltaire.com
Fri May 26 17:59:46 PDT 2006


Don,

On Fri, 2006-05-26 at 17:32, Don.Albert at Bull.com wrote:
> Hal,
> 
> I rebuilt the opensm executable with the patch you provided.   The
> patch fixes (or avoids) the segmentation fault and opensm comes up and
> runs.

Thanks for trying this out.

>   However, the link is still not becoming operational.   On the local
> side it goes to ARMED,  and on the remote side it goes to INIT.   The
> osm.log seems to show that the MAD packets are timing out.

Yes, as I mentioned the remote end is not responding to SMA packets (as
the right modules appear to be loaded to do that). I don't know why this
is but this is NOT an OpenSM issue.

>   Here is the first part of the file, it just repeats after this at
> one minute intervals.

Right, OpenSM sees the Physical Link Up and tries to bring the port to
active but can't because the remote SMA is not responding. Periodically,
it downs the port and reattempts to bring it back up (but can't).

> [koa] (ib) root> cat /var/log/osm.log
> May 26 14:05:43 369104 [8EFC3D00] -> OpenSM Rev:openib-1.2.0 OpenIB
> svn Exported revision
> May 26 14:05:43 369260 [0000] -> OpenSM Rev:openib-1.2.0 OpenIB svn
> Exported revision
> 
> May 26 14:05:43 370571 [8EFC3D00] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
> 0,0x0000000000000000
> May 26 14:05:43 370631 [8EFC3D00] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
> 0,0x0000000000000000
> May 26 14:05:43 373005 [8EFC3D00] -> osm_vendor_bind: Binding to port
> 0x2c90200216dc5
> May 26 14:05:43 374685 [8EFC3D00] -> osm_vendor_bind: Binding to port
> 0x2c90200216dc5
> May 26 14:05:44 172028 [44007960] -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x11 trans_id=0x1239) -- dr
> opping
> May 26 14:05:44 172070 [44007960] -> umad_receiver: ERR 5411: DR SMP
> May 26 14:05:44 172083 [44007960] -> __osm_sm_mad_ctrl_send_err_cb:
> ERR 3113: MAD completed in error (IB_TIMEOUT)
> May 26 14:05:44 172148 [44007960] -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x1 (SubnGet)
>                                 D bit...................0x0
>                                 status..................0x0
>                                 hop_ptr.................0x0
>                                 hop_count...............0x1
>                                 trans_id................0x1239
>                                 attr_id.................0x11
> (NodeInfo)
>                                 resv....................0x0
>                                 attr_mod................0x0
>                                
> m_key...................0x0000000000000000
>                                 dr_slid.................0xFFFF
>                                 dr_dlid.................0xFFFF
> 
>                                 Initial path: [0][1]
>                                 Return path:  [0][0]
>                                 Reserved:     [0][0][0][0][0][0][0]
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
> 
> May 26 14:05:44 172199 [42003960] -> osm_drop_mgr_process: ERR 0108:
> Unknown remote side for node 0x0002c90200216dc4 port 1. Adding
> to light sweep sampling list
> May 26 14:05:44 172240 [42003960] -> Directed Path Dump of 0 hop path:
>                                 Path = [0]
> May 26 14:05:44 172256 [0000] -> Entering MASTER state
> 
> May 26 14:05:44 179081 [0000] -> SUBNET UP
> 
> May 26 14:05:54 180461 [44007960] -> umad_receiver: ERR 5409: send
> completed with error (method=0x1 attr=0x11 trans_id=0x1240) -- dr
> opping
> May 26 14:05:54 180515 [44007960] -> umad_receiver: ERR 5411: DR SMP
> May 26 14:05:54 180528 [44007960] -> __osm_sm_mad_ctrl_send_err_cb:
> ERR 3113: MAD completed in error (IB_TIMEOUT)
> May 26 14:05:54 180569 [44007960] -> SMP dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x81
>                                 class_ver...............0x1
>                                 method..................0x1 (SubnGet)
>                                 D bit...................0x0
>                                 status..................0x0
>                                 hop_ptr.................0x0
>                                 hop_count...............0x1
>                                 trans_id................0x1240
>                                 attr_id.................0x11
> (NodeInfo)
>                                 resv....................0x0
>                                 attr_mod................0x0
>                                
> m_key...................0x0000000000000000
>                                 dr_slid.................0xFFFF
>                                 dr_dlid.................0xFFFF
> 
>                                 Initial path: [0][1]
>                                 Return path:  [0][0]
>                                 Reserved:     [0][0][0][0][0][0][0]
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00
> 00 00 00 00
> 
> May 26 14:05:54 180624 [42003960] -> osm_drop_mgr_process: ERR 0108:
> Unknown remote side for node 0x0002c90200216dc4 port 1. Adding
> to light sweep sampling list
> May 26 14:05:54 180649 [42003960] -> Directed Path Dump of 0 hop path:
>                                 Path = [0]
> 
> 
> The physical link appears to be up:



>   here are the ibstat, ibstatus results for both sides:
> 
> Local system
> 
> [koa] (ib) root> ibstat
> CA 'mthca0'
>         CA type: MT25204
>         Number of ports: 1
>         Firmware version: 1.0.800
>         Hardware version: a0
>         Node GUID: 0x0002c90200216dc4
>         System image GUID: 0x0002c90200216dc7
>         Port 1:
>                 State: Armed
>                 Physical state: LinkUp
>                 Rate: 20
>                 Base lid: 2
>                 LMC: 0
>                 SM lid: 2
>                 Capability mask: 0x02510a6a
>                 Port GUID: 0x0002c90200216dc5
> [koa] (ib) root> ibstatus
> Infiniband device 'mthca0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0002:c902:0021:6dc5
>         base lid:        0x2
>         sm lid:          0x2
>         state:           3: ARMED
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
> 
> Remote system
> 
> [jatoba] (ib) ib> ibstat
> CA 'mthca0'
>         CA type: MT25204
>         Number of ports: 1
>         Firmware version: 1.0.800
>         Hardware version: a0
>         Node GUID: 0x0002c90200216e40
>         System image GUID: 0x0002c90200216e43
>         Port 1:
>                 State: Initializing
>                 Physical state: LinkUp
>                 Rate: 20
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x02510a68
>                 Port GUID: 0x0002c90200216e41
> [jatoba] (ib) ib> ibstatus
> Infiniband device 'mthca0' port 1 status:
>         default gid:     fe80:0000:0000:0000:0002:c902:0021:6e41
>         base lid:        0x0
>         sm lid:          0x0
>         state:           2: INIT
>         phys state:      5: LinkUp
>         rate:            20 Gb/sec (4X DDR)
> 
> An "ibnetdiscover" on the local system gives the following:
> 
> [koa] (ib) root> ibnetdiscover
> ibwarn: [20638] handle_port: NodeInfo on DR path [0][1] port 1 failed,
>  skipping port

Right; that's the same thing the SM sees. The remote SMA is not
responding to requests (same request SM Get NodeInfo).

> #
> # Topology file: generated on Fri May 26 14:24:20 2006
> #
> # Max of 1 hops discovered
> # Initiated from node 0002c90200216dc4 port 0002c90200216dc5
> 
> vendid=0x2c9
> devid=0x6274
> sysimgguid=0x2c90200216dc7
> caguid=0x2c90200216dc4
> Ca      1 "H-0002c90200216dc4"          # koa HCA-1
> 
> What next, coach?

Can you turn on madeye on the remote node and see what packets are
received and sent ? Let me know if you need help with that. I think you
said you were running OFED, right ?

-- Hal

>   -Don Albert-




More information about the ewg mailing list