[openib-general] Unreliable OpemSM failover
Venkatesh Babu
venkatesh.babu at 3leafnetworks.com
Mon Nov 27 16:03:20 PST 2006
Hi
I have topology of two switches and a bunch of nodes, with each node
having 2port HCAs. Port1 of every node connects to switch1 and Port2 of
every node connects to switch2. So Port1 and Port2 are in different
subnets. So I am running one OpenSM (from OFED 1.0) for each port on one
node designated as a server. To guard against that server going down I
have another server node to run the OpenSM in "standby" mode for each
port. I will adjust the priorities such that first server always has
"master" OpenSM and second server has "standby" OpenSM.
When the first server rebooted, "standby" OpenSM should takeover the
mastership. It usually works fine but sometimes it is failing to
takeover. In the following example OpenSM for Port1 failed to takeover,
but OpenSM for Port2 took over and became "master". The OpenSM for Port1
seems be stuck in some weired state (strace shows that it is sleeping).
It is no longer assigning LIDs to the rest of the nodes in the subnet
and not responding to the broadcast joins. The log file shows nothing
from past 4 days. I have the complete log files if needed.
Is this a known problem and fixed in OFED 1.1 ?
[root at vortex3l-72 158]# ibv_devinfo
hca_id: mthca0
fw_ver: 5.1.400
node_guid: 0050:4501:4b1a:0000
sys_image_guid: 0050:4501:4b1a:0003
vendor_id: 0x02c9
vendor_part_id: 25218
hw_ver: 0xA0
board_id: ARM0020000001
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 7
port_lid: 1
port_lmc: 0x00
port: 2
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
[root at vortex3l-72 158]# ps -aux | grep open
Warning: bad syntax, perhaps a bogus '-'? See
/usr/share/doc/procps-3.2.3/FAQ
root 7988 0.0 0.0 92784 1672 ? Sl Nov22 0:06
/usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f /var/log/opensm2.log
root 7975 0.0 0.0 92784 1572 ? Sl Nov22 0:06
/usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f /var/log/opensm1.log
root 7803 0.0 0.0 51096 668 pts/0 S+ 12:11 0:00 grep open
[root at vortex3l-72 158]# strace -p7975
Process 7975 attached - interrupt to quit
restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
nanosleep({10, 0}, NULL) = 0
nanosleep({10, 0}, NULL) = 0
nanosleep({10, 0}, NULL) = 0
nanosleep({10, 0}, NULL) = 0
nanosleep({10, 0}, NULL) = 0
nanosleep({10, 0}, NULL) = 0
nanosleep({10, 0}, NULL) = 0
nanosleep({10, 0}, NULL) = 0
nanosleep({10, 0}, <unfinished ...>
Process 7975 detached
[root at vortex3l-72 158]# uptime
12:13:02 up 4 days, 17:05, 5 users, load average: 0.00, 0.00, 0.00
[root at vortex3l-72 158]# date
Mon Nov 27 12:13:05 PST 2006
[root at vortex3l-72 158]# tail /var/log/opensm1.log
Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 3673M
Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting
Generic Notice type:3 num:66 from LID:0x0000
GID:0xfe80000000000000,0x0000000000000000
Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting
Generic Notice type:3 num:66 from LID:0x0000
GID:0xfe80000000000000,0x0000000000000000
Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port
0x5045014b1a0001
Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port
0x5045014b1a0001
Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
[root at vortex3l-72 158]# tail /var/log/opensm2.log
00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting
Generic Notice type:3 num:65 from LID:0x0001
GID:0xfe80000000000000,0x005045014b1a0002
Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: Cannot
find destination port with LID:0x0002
Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: Cannot
find destination port with LID:0x0003
Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: Cannot
find destination port with LID:0x0004
Nov 27 12:10:32 146382 [41401960] -> Removed port with
GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to light
sweep sampling list
Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
Path = [0][2]
More information about the general
mailing list