[openib-general] Unreliable OpemSM failover
Hal Rosenstock
halr at voltaire.com
Mon Nov 27 16:48:07 PST 2006
Hi,
On Mon, 2006-11-27 at 19:03, Venkatesh Babu wrote:
> Hi
>
> I have topology of two switches and a bunch of nodes, with each node
> having 2port HCAs. Port1 of every node connects to switch1 and Port2 of
> every node connects to switch2. So Port1 and Port2 are in different
> subnets.
Are the two switches not connected to each other ?
> So I am running one OpenSM (from OFED 1.0) for each port on one
> node designated as a server. To guard against that server going down I
> have another server node to run the OpenSM in "standby" mode for each
> port. I will adjust the priorities such that first server always has
> "master" OpenSM and second server has "standby" OpenSM.
Are the subnet prefixes configured ?
> When the first server rebooted, "standby" OpenSM should takeover the
> mastership. It usually works fine but sometimes it is failing to
> takeover. In the following example OpenSM for Port1 failed to takeover,
> but OpenSM for Port2 took over and became "master". The OpenSM for Port1
> seems be stuck in some weired state (strace shows that it is sleeping).
> It is no longer assigning LIDs to the rest of the nodes in the subnet
> and not responding to the broadcast joins. The log file shows nothing
> from past 4 days. I have the complete log files if needed.
>
> Is this a known problem and fixed in OFED 1.1 ?
I don't see any explicit changes to the SM state machine which would
affect this but as I have mentioned before there are many bug fixes in
OFED 1.1. I can't conclusively state whether this would fix the issue
you see but would be in a much better position to try to figure this
out.
-- Hal
> [root at vortex3l-72 158]# ibv_devinfo
> hca_id: mthca0
> fw_ver: 5.1.400
> node_guid: 0050:4501:4b1a:0000
> sys_image_guid: 0050:4501:4b1a:0003
> vendor_id: 0x02c9
> vendor_part_id: 25218
> hw_ver: 0xA0
> board_id: ARM0020000001
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 7
> port_lid: 1
> port_lmc: 0x00
>
> port: 2
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 1
> port_lmc: 0x00
>
> [root at vortex3l-72 158]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See
> /usr/share/doc/procps-3.2.3/FAQ
> root 7988 0.0 0.0 92784 1672 ? Sl Nov22 0:06
> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f /var/log/opensm2.log
> root 7975 0.0 0.0 92784 1572 ? Sl Nov22 0:06
> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f /var/log/opensm1.log
> root 7803 0.0 0.0 51096 668 pts/0 S+ 12:11 0:00 grep open
> [root at vortex3l-72 158]# strace -p7975
> Process 7975 attached - interrupt to quit
> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
> nanosleep({10, 0}, NULL) = 0
> nanosleep({10, 0}, NULL) = 0
> nanosleep({10, 0}, NULL) = 0
> nanosleep({10, 0}, NULL) = 0
> nanosleep({10, 0}, NULL) = 0
> nanosleep({10, 0}, NULL) = 0
> nanosleep({10, 0}, NULL) = 0
> nanosleep({10, 0}, NULL) = 0
> nanosleep({10, 0}, <unfinished ...>
> Process 7975 detached
> [root at vortex3l-72 158]# uptime
> 12:13:02 up 4 days, 17:05, 5 users, load average: 0.00, 0.00, 0.00
> [root at vortex3l-72 158]# date
> Mon Nov 27 12:13:05 PST 2006
> [root at vortex3l-72 158]# tail /var/log/opensm1.log
> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 3673M
>
> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000
> GID:0xfe80000000000000,0x0000000000000000
> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000
> GID:0xfe80000000000000,0x0000000000000000
> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port
> 0x5045014b1a0001
> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port
> 0x5045014b1a0001
> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
>
> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
>
> [root at vortex3l-72 158]# tail /var/log/opensm2.log
> 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
>
> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting
> Generic Notice type:3 num:65 from LID:0x0001
> GID:0xfe80000000000000,0x005045014b1a0002
> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: Cannot
> find destination port with LID:0x0002
> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: Cannot
> find destination port with LID:0x0003
> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: Cannot
> find destination port with LID:0x0004
> Nov 27 12:10:32 146382 [41401960] -> Removed port with
> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108:
> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to light
> sweep sampling list
> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
> Path = [0][2]
>
>
More information about the general
mailing list