[openib-general] Unreliable OpemSM failover

Hal Rosenstock halr at voltaire.com
Fri Dec 8 14:44:40 PST 2006


On Fri, 2006-12-08 at 17:12, Venkatesh Babu wrote:
> I have got the same problem with OFED 1.1 stack also, but the frequency 
> is less. I had to try 120 fail overs (by rebooting the highest priority 
> OpenSM server) before getting into this problem.

If I understand you correctly, you reboot the master SM and the standby
does not takeover (become master). Is that correct ?

Is this with 2 SMs or more ?

> At this state OpenSM doesn't update anything to the log files; 
> doesn't assign the LIDs to the other nodes; doesn't respond 
> to the multi cast join operations. Even another OpenSM is 
> started on another node with higher priority it can 
> not become the master. The only way to recover from this is by killing 
> the stuck OpenSM.

What SMLID do the nodes in the subnet point to ?

Can you determine where is it stuck ? Sounds like it could be in some
tight loop. Can you build with gdb and attach when this occurs to see ?

-- Hal

>  VBabu
> 
> Hal Rosenstock wrote:
> 
> >I don't see any explicit changes to the SM state machine which would
> >affect this but as I have mentioned before there are many bug fixes in
> >OFED 1.1. I can't conclusively state whether this would fix the issue
> >you see but would be in a much better position to try to figure this
> >out.
> >
> >-- Hal
> >
> >  
> >
> >> Hi
> >>
> >>   I have topology of two switches and a bunch of nodes, with each 
> >> node having 2port HCAs. Port1 of every node connects to switch1 and 
> >> Port2 of every node connects to switch2. So Port1 and Port2 are in 
> >> different subnets. So I am running one OpenSM (from OFED 1.0) for 
> >> each port on one node designated as a server. To guard against that 
> >> server going down I have another server node to run the OpenSM in 
> >> "standby" mode for each port. I will adjust the priorities such that 
> >> first server always has "master" OpenSM and second server has 
> >> "standby" OpenSM.
> >>
> >>    When the first server rebooted, "standby" OpenSM should takeover 
> >> the mastership. It usually works fine but sometimes it is failing to 
> >> takeover. In the following example OpenSM for Port1 failed to 
> >> takeover, but OpenSM for Port2 took over and became "master". The 
> >> OpenSM for Port1 seems be stuck in some weired state (strace shows 
> >> that it is sleeping). It is no longer assigning LIDs to the rest of 
> >> the nodes in the subnet and not responding to the broadcast joins. 
> >> The log file shows nothing from past 4 days. I have the complete log 
> >> files if needed.
> >>
> >>    Is this a known problem and fixed in OFED 1.1 ?
> >>
> >> [root at vortex3l-72 158]# ibv_devinfo
> >> hca_id: mthca0
> >>        fw_ver:                         5.1.400
> >>        node_guid:                      0050:4501:4b1a:0000
> >>        sys_image_guid:                 0050:4501:4b1a:0003
> >>        vendor_id:                      0x02c9
> >>        vendor_part_id:                 25218
> >>        hw_ver:                         0xA0
> >>        board_id:                       ARM0020000001
> >>        phys_port_cnt:                  2
> >>                port:   1
> >>                        state:                  PORT_ACTIVE (4)
> >>                        max_mtu:                2048 (4)
> >>                        active_mtu:             2048 (4)
> >>                        sm_lid:                 7
> >>                        port_lid:               1
> >>                        port_lmc:               0x00
> >>
> >>                port:   2
> >>                        state:                  PORT_ACTIVE (4)
> >>                        max_mtu:                2048 (4)
> >>                        active_mtu:             2048 (4)
> >>                        sm_lid:                 1
> >>                        port_lid:               1
> >>                        port_lmc:               0x00
> >>
> >> [root at vortex3l-72 158]# ps -aux | grep open
> >> Warning: bad syntax, perhaps a bogus '-'? See 
> >> /usr/share/doc/procps-3.2.3/FAQ
> >> root      7988  0.0  0.0 92784 1672 ?        Sl   Nov22   0:06 
> >> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f 
> >> /var/log/opensm2.log
> >> root      7975  0.0  0.0 92784 1572 ?        Sl   Nov22   0:06 
> >> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f 
> >> /var/log/opensm1.log
> >> root      7803  0.0  0.0 51096  668 pts/0    S+   12:11   0:00 grep open
> >> [root at vortex3l-72 158]# strace -p7975
> >> Process 7975 attached - interrupt to quit
> >> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0},  <unfinished ...>
> >> Process 7975 detached
> >> [root at vortex3l-72 158]# uptime
> >> 12:13:02 up 4 days, 17:05,  5 users,  load average: 0.00, 0.00, 0.00
> >> [root at vortex3l-72 158]# date
> >> Mon Nov 27 12:13:05 PST 2006
> >> [root at vortex3l-72 158]#  tail /var/log/opensm1.log
> >> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
> >> 3673M
> >>
> >> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting 
> >> Generic Notice type:3 num:66 from LID:0x0000 
> >> GID:0xfe80000000000000,0x0000000000000000
> >> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting 
> >> Generic Notice type:3 num:66 from LID:0x0000 
> >> GID:0xfe80000000000000,0x0000000000000000
> >> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port 
> >> 0x5045014b1a0001
> >> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port 
> >> 0x5045014b1a0001
> >> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
> >>
> >> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
> >>
> >> [root at vortex3l-72 158]#  tail /var/log/opensm2.log
> >>                                00 00 00 00 00 00 00 00   00 00 00 00 
> >> 00 00 00 00
> >>
> >> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting 
> >> Generic Notice type:3 num:65 from LID:0x0001 
> >> GID:0xfe80000000000000,0x005045014b1a0002
> >> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: 
> >> Cannot find destination port with LID:0x0002
> >> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: 
> >> Cannot find destination port with LID:0x0003
> >> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: 
> >> Cannot find destination port with LID:0x0004
> >> Nov 27 12:10:32 146382 [41401960] -> Removed port with 
> >> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
> >> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: 
> >> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to 
> >> light sweep sampling list
> >> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
> >>                                Path = [0][2]
> >>





More information about the general mailing list