[openib-general] Unreliable OpemSM failover
Hal Rosenstock
halr at voltaire.com
Fri Dec 8 14:44:40 PST 2006
On Fri, 2006-12-08 at 17:12, Venkatesh Babu wrote:
> I have got the same problem with OFED 1.1 stack also, but the frequency
> is less. I had to try 120 fail overs (by rebooting the highest priority
> OpenSM server) before getting into this problem.
If I understand you correctly, you reboot the master SM and the standby
does not takeover (become master). Is that correct ?
Is this with 2 SMs or more ?
> At this state OpenSM doesn't update anything to the log files;
> doesn't assign the LIDs to the other nodes; doesn't respond
> to the multi cast join operations. Even another OpenSM is
> started on another node with higher priority it can
> not become the master. The only way to recover from this is by killing
> the stuck OpenSM.
What SMLID do the nodes in the subnet point to ?
Can you determine where is it stuck ? Sounds like it could be in some
tight loop. Can you build with gdb and attach when this occurs to see ?
-- Hal
> VBabu
>
> Hal Rosenstock wrote:
>
> >I don't see any explicit changes to the SM state machine which would
> >affect this but as I have mentioned before there are many bug fixes in
> >OFED 1.1. I can't conclusively state whether this would fix the issue
> >you see but would be in a much better position to try to figure this
> >out.
> >
> >-- Hal
> >
> >
> >
> >> Hi
> >>
> >> I have topology of two switches and a bunch of nodes, with each
> >> node having 2port HCAs. Port1 of every node connects to switch1 and
> >> Port2 of every node connects to switch2. So Port1 and Port2 are in
> >> different subnets. So I am running one OpenSM (from OFED 1.0) for
> >> each port on one node designated as a server. To guard against that
> >> server going down I have another server node to run the OpenSM in
> >> "standby" mode for each port. I will adjust the priorities such that
> >> first server always has "master" OpenSM and second server has
> >> "standby" OpenSM.
> >>
> >> When the first server rebooted, "standby" OpenSM should takeover
> >> the mastership. It usually works fine but sometimes it is failing to
> >> takeover. In the following example OpenSM for Port1 failed to
> >> takeover, but OpenSM for Port2 took over and became "master". The
> >> OpenSM for Port1 seems be stuck in some weired state (strace shows
> >> that it is sleeping). It is no longer assigning LIDs to the rest of
> >> the nodes in the subnet and not responding to the broadcast joins.
> >> The log file shows nothing from past 4 days. I have the complete log
> >> files if needed.
> >>
> >> Is this a known problem and fixed in OFED 1.1 ?
> >>
> >> [root at vortex3l-72 158]# ibv_devinfo
> >> hca_id: mthca0
> >> fw_ver: 5.1.400
> >> node_guid: 0050:4501:4b1a:0000
> >> sys_image_guid: 0050:4501:4b1a:0003
> >> vendor_id: 0x02c9
> >> vendor_part_id: 25218
> >> hw_ver: 0xA0
> >> board_id: ARM0020000001
> >> phys_port_cnt: 2
> >> port: 1
> >> state: PORT_ACTIVE (4)
> >> max_mtu: 2048 (4)
> >> active_mtu: 2048 (4)
> >> sm_lid: 7
> >> port_lid: 1
> >> port_lmc: 0x00
> >>
> >> port: 2
> >> state: PORT_ACTIVE (4)
> >> max_mtu: 2048 (4)
> >> active_mtu: 2048 (4)
> >> sm_lid: 1
> >> port_lid: 1
> >> port_lmc: 0x00
> >>
> >> [root at vortex3l-72 158]# ps -aux | grep open
> >> Warning: bad syntax, perhaps a bogus '-'? See
> >> /usr/share/doc/procps-3.2.3/FAQ
> >> root 7988 0.0 0.0 92784 1672 ? Sl Nov22 0:06
> >> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f
> >> /var/log/opensm2.log
> >> root 7975 0.0 0.0 92784 1572 ? Sl Nov22 0:06
> >> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f
> >> /var/log/opensm1.log
> >> root 7803 0.0 0.0 51096 668 pts/0 S+ 12:11 0:00 grep open
> >> [root at vortex3l-72 158]# strace -p7975
> >> Process 7975 attached - interrupt to quit
> >> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
> >> nanosleep({10, 0}, NULL) = 0
> >> nanosleep({10, 0}, NULL) = 0
> >> nanosleep({10, 0}, NULL) = 0
> >> nanosleep({10, 0}, NULL) = 0
> >> nanosleep({10, 0}, NULL) = 0
> >> nanosleep({10, 0}, NULL) = 0
> >> nanosleep({10, 0}, NULL) = 0
> >> nanosleep({10, 0}, NULL) = 0
> >> nanosleep({10, 0}, <unfinished ...>
> >> Process 7975 detached
> >> [root at vortex3l-72 158]# uptime
> >> 12:13:02 up 4 days, 17:05, 5 users, load average: 0.00, 0.00, 0.00
> >> [root at vortex3l-72 158]# date
> >> Mon Nov 27 12:13:05 PST 2006
> >> [root at vortex3l-72 158]# tail /var/log/opensm1.log
> >> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn
> >> 3673M
> >>
> >> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting
> >> Generic Notice type:3 num:66 from LID:0x0000
> >> GID:0xfe80000000000000,0x0000000000000000
> >> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting
> >> Generic Notice type:3 num:66 from LID:0x0000
> >> GID:0xfe80000000000000,0x0000000000000000
> >> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port
> >> 0x5045014b1a0001
> >> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port
> >> 0x5045014b1a0001
> >> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
> >>
> >> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
> >>
> >> [root at vortex3l-72 158]# tail /var/log/opensm2.log
> >> 00 00 00 00 00 00 00 00 00 00 00 00
> >> 00 00 00 00
> >>
> >> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting
> >> Generic Notice type:3 num:65 from LID:0x0001
> >> GID:0xfe80000000000000,0x005045014b1a0002
> >> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec:
> >> Cannot find destination port with LID:0x0002
> >> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec:
> >> Cannot find destination port with LID:0x0003
> >> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec:
> >> Cannot find destination port with LID:0x0004
> >> Nov 27 12:10:32 146382 [41401960] -> Removed port with
> >> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
> >> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108:
> >> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to
> >> light sweep sampling list
> >> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
> >> Path = [0][2]
> >>
More information about the general
mailing list