[openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5
Don.Albert at Bull.com
Don.Albert at Bull.com
Fri May 26 14:32:16 PDT 2006
Hal,
I rebuilt the opensm executable with the patch you provided. The patch
fixes (or avoids) the segmentation fault and opensm comes up and runs.
However, the link is still not becoming operational. On the local side
it goes to ARMED, and on the remote side it goes to INIT. The osm.log
seems to show that the MAD packets are timing out. Here is the first part
of the file, it just repeats after this at one minute intervals.
[koa] (ib) root> cat /var/log/osm.log
May 26 14:05:43 369104 [8EFC3D00] -> OpenSM Rev:openib-1.2.0 OpenIB svn
Exported revision
May 26 14:05:43 369260 [0000] -> OpenSM Rev:openib-1.2.0 OpenIB svn
Exported revision
May 26 14:05:43 370571 [8EFC3D00] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
0,0x0000000000000000
May 26 14:05:43 370631 [8EFC3D00] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
0,0x0000000000000000
May 26 14:05:43 373005 [8EFC3D00] -> osm_vendor_bind: Binding to port
0x2c90200216dc5
May 26 14:05:43 374685 [8EFC3D00] -> osm_vendor_bind: Binding to port
0x2c90200216dc5
May 26 14:05:44 172028 [44007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x1239) -- dr
opping
May 26 14:05:44 172070 [44007960] -> umad_receiver: ERR 5411: DR SMP
May 26 14:05:44 172083 [44007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
May 26 14:05:44 172148 [44007960] -> SMP dump:
base_ver................0x1
mgmt_class..............0x81
class_ver...............0x1
method..................0x1 (SubnGet)
D bit...................0x0
status..................0x0
hop_ptr.................0x0
hop_count...............0x1
trans_id................0x1239
attr_id.................0x11 (NodeInfo)
resv....................0x0
attr_mod................0x0
m_key...................0x0000000000000000
dr_slid.................0xFFFF
dr_dlid.................0xFFFF
Initial path: [0][1]
Return path: [0][0]
Reserved: [0][0][0][0][0][0][0]
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
May 26 14:05:44 172199 [42003960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c90200216dc4 port 1. Adding
to light sweep sampling list
May 26 14:05:44 172240 [42003960] -> Directed Path Dump of 0 hop path:
Path = [0]
May 26 14:05:44 172256 [0000] -> Entering MASTER state
May 26 14:05:44 179081 [0000] -> SUBNET UP
May 26 14:05:54 180461 [44007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x1240) -- dr
opping
May 26 14:05:54 180515 [44007960] -> umad_receiver: ERR 5411: DR SMP
May 26 14:05:54 180528 [44007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
May 26 14:05:54 180569 [44007960] -> SMP dump:
base_ver................0x1
mgmt_class..............0x81
class_ver...............0x1
method..................0x1 (SubnGet)
D bit...................0x0
status..................0x0
hop_ptr.................0x0
hop_count...............0x1
trans_id................0x1240
attr_id.................0x11 (NodeInfo)
resv....................0x0
attr_mod................0x0
m_key...................0x0000000000000000
dr_slid.................0xFFFF
dr_dlid.................0xFFFF
Initial path: [0][1]
Return path: [0][0]
Reserved: [0][0][0][0][0][0][0]
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
May 26 14:05:54 180624 [42003960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c90200216dc4 port 1. Adding
to light sweep sampling list
May 26 14:05:54 180649 [42003960] -> Directed Path Dump of 0 hop path:
Path = [0]
The physical link appears to be up: here are the ibstat, ibstatus results
for both sides:
Local system
[koa] (ib) root> ibstat
CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.0.800
Hardware version: a0
Node GUID: 0x0002c90200216dc4
System image GUID: 0x0002c90200216dc7
Port 1:
State: Armed
Physical state: LinkUp
Rate: 20
Base lid: 2
LMC: 0
SM lid: 2
Capability mask: 0x02510a6a
Port GUID: 0x0002c90200216dc5
[koa] (ib) root> ibstatus
Infiniband device 'mthca0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c902:0021:6dc5
base lid: 0x2
sm lid: 0x2
state: 3: ARMED
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
Remote system
[jatoba] (ib) ib> ibstat
CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.0.800
Hardware version: a0
Node GUID: 0x0002c90200216e40
System image GUID: 0x0002c90200216e43
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 20
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0002c90200216e41
[jatoba] (ib) ib> ibstatus
Infiniband device 'mthca0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c902:0021:6e41
base lid: 0x0
sm lid: 0x0
state: 2: INIT
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
An "ibnetdiscover" on the local system gives the following:
[koa] (ib) root> ibnetdiscover
ibwarn: [20638] handle_port: NodeInfo on DR path [0][1] port 1 failed,
skipping port
#
# Topology file: generated on Fri May 26 14:24:20 2006
#
# Max of 1 hops discovered
# Initiated from node 0002c90200216dc4 port 0002c90200216dc5
vendid=0x2c9
devid=0x6274
sysimgguid=0x2c90200216dc7
caguid=0x2c90200216dc4
Ca 1 "H-0002c90200216dc4" # koa HCA-1
What next, coach?
-Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060526/293ff223/attachment.html>
More information about the ewg
mailing list