[openib-general] Unreliable OpemSM failover

Venkatesh Babu venkatesh.babu at 3leafnetworks.com
Fri Dec 8 14:12:03 PST 2006


 I have got the same problem with OFED 1.1 stack also, but the frequency 
is less. I had to try 120 fail overs (by rebooting the highest priority 
OpenSM server) before getting into this problem. At this state OpenSM 
doesn't update anything to the log files; doesn't assign the LIDs to the 
other nodes; doesn't respond to the multi cast join operations. Even 
another OpenSM is started on another node with higher priority it can 
not become the master. The only way to recover from this is by killing 
the stuck OpenSM.

 VBabu

Hal Rosenstock wrote:

>I don't see any explicit changes to the SM state machine which would
>affect this but as I have mentioned before there are many bug fixes in
>OFED 1.1. I can't conclusively state whether this would fix the issue
>you see but would be in a much better position to try to figure this
>out.
>
>-- Hal
>
>  
>
>> Hi
>>
>>   I have topology of two switches and a bunch of nodes, with each 
>> node having 2port HCAs. Port1 of every node connects to switch1 and 
>> Port2 of every node connects to switch2. So Port1 and Port2 are in 
>> different subnets. So I am running one OpenSM (from OFED 1.0) for 
>> each port on one node designated as a server. To guard against that 
>> server going down I have another server node to run the OpenSM in 
>> "standby" mode for each port. I will adjust the priorities such that 
>> first server always has "master" OpenSM and second server has 
>> "standby" OpenSM.
>>
>>    When the first server rebooted, "standby" OpenSM should takeover 
>> the mastership. It usually works fine but sometimes it is failing to 
>> takeover. In the following example OpenSM for Port1 failed to 
>> takeover, but OpenSM for Port2 took over and became "master". The 
>> OpenSM for Port1 seems be stuck in some weired state (strace shows 
>> that it is sleeping). It is no longer assigning LIDs to the rest of 
>> the nodes in the subnet and not responding to the broadcast joins. 
>> The log file shows nothing from past 4 days. I have the complete log 
>> files if needed.
>>
>>    Is this a known problem and fixed in OFED 1.1 ?
>>
>> [root at vortex3l-72 158]# ibv_devinfo
>> hca_id: mthca0
>>        fw_ver:                         5.1.400
>>        node_guid:                      0050:4501:4b1a:0000
>>        sys_image_guid:                 0050:4501:4b1a:0003
>>        vendor_id:                      0x02c9
>>        vendor_part_id:                 25218
>>        hw_ver:                         0xA0
>>        board_id:                       ARM0020000001
>>        phys_port_cnt:                  2
>>                port:   1
>>                        state:                  PORT_ACTIVE (4)
>>                        max_mtu:                2048 (4)
>>                        active_mtu:             2048 (4)
>>                        sm_lid:                 7
>>                        port_lid:               1
>>                        port_lmc:               0x00
>>
>>                port:   2
>>                        state:                  PORT_ACTIVE (4)
>>                        max_mtu:                2048 (4)
>>                        active_mtu:             2048 (4)
>>                        sm_lid:                 1
>>                        port_lid:               1
>>                        port_lmc:               0x00
>>
>> [root at vortex3l-72 158]# ps -aux | grep open
>> Warning: bad syntax, perhaps a bogus '-'? See 
>> /usr/share/doc/procps-3.2.3/FAQ
>> root      7988  0.0  0.0 92784 1672 ?        Sl   Nov22   0:06 
>> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f 
>> /var/log/opensm2.log
>> root      7975  0.0  0.0 92784 1572 ?        Sl   Nov22   0:06 
>> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f 
>> /var/log/opensm1.log
>> root      7803  0.0  0.0 51096  668 pts/0    S+   12:11   0:00 grep open
>> [root at vortex3l-72 158]# strace -p7975
>> Process 7975 attached - interrupt to quit
>> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0},  <unfinished ...>
>> Process 7975 detached
>> [root at vortex3l-72 158]# uptime
>> 12:13:02 up 4 days, 17:05,  5 users,  load average: 0.00, 0.00, 0.00
>> [root at vortex3l-72 158]# date
>> Mon Nov 27 12:13:05 PST 2006
>> [root at vortex3l-72 158]#  tail /var/log/opensm1.log
>> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
>> 3673M
>>
>> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting 
>> Generic Notice type:3 num:66 from LID:0x0000 
>> GID:0xfe80000000000000,0x0000000000000000
>> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting 
>> Generic Notice type:3 num:66 from LID:0x0000 
>> GID:0xfe80000000000000,0x0000000000000000
>> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port 
>> 0x5045014b1a0001
>> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port 
>> 0x5045014b1a0001
>> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
>>
>> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
>>
>> [root at vortex3l-72 158]#  tail /var/log/opensm2.log
>>                                00 00 00 00 00 00 00 00   00 00 00 00 
>> 00 00 00 00
>>
>> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting 
>> Generic Notice type:3 num:65 from LID:0x0001 
>> GID:0xfe80000000000000,0x005045014b1a0002
>> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: 
>> Cannot find destination port with LID:0x0002
>> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: 
>> Cannot find destination port with LID:0x0003
>> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: 
>> Cannot find destination port with LID:0x0004
>> Nov 27 12:10:32 146382 [41401960] -> Removed port with 
>> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
>> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: 
>> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to 
>> light sweep sampling list
>> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
>>                                Path = [0][2]
>>




More information about the general mailing list