[openib-general] Unreliable OpemSM failover
Venkatesh Babu
venkatesh.babu at 3leafnetworks.com
Fri Dec 8 14:12:03 PST 2006
I have got the same problem with OFED 1.1 stack also, but the frequency
is less. I had to try 120 fail overs (by rebooting the highest priority
OpenSM server) before getting into this problem. At this state OpenSM
doesn't update anything to the log files; doesn't assign the LIDs to the
other nodes; doesn't respond to the multi cast join operations. Even
another OpenSM is started on another node with higher priority it can
not become the master. The only way to recover from this is by killing
the stuck OpenSM.
VBabu
Hal Rosenstock wrote:
>I don't see any explicit changes to the SM state machine which would
>affect this but as I have mentioned before there are many bug fixes in
>OFED 1.1. I can't conclusively state whether this would fix the issue
>you see but would be in a much better position to try to figure this
>out.
>
>-- Hal
>
>
>
>> Hi
>>
>> I have topology of two switches and a bunch of nodes, with each
>> node having 2port HCAs. Port1 of every node connects to switch1 and
>> Port2 of every node connects to switch2. So Port1 and Port2 are in
>> different subnets. So I am running one OpenSM (from OFED 1.0) for
>> each port on one node designated as a server. To guard against that
>> server going down I have another server node to run the OpenSM in
>> "standby" mode for each port. I will adjust the priorities such that
>> first server always has "master" OpenSM and second server has
>> "standby" OpenSM.
>>
>> When the first server rebooted, "standby" OpenSM should takeover
>> the mastership. It usually works fine but sometimes it is failing to
>> takeover. In the following example OpenSM for Port1 failed to
>> takeover, but OpenSM for Port2 took over and became "master". The
>> OpenSM for Port1 seems be stuck in some weired state (strace shows
>> that it is sleeping). It is no longer assigning LIDs to the rest of
>> the nodes in the subnet and not responding to the broadcast joins.
>> The log file shows nothing from past 4 days. I have the complete log
>> files if needed.
>>
>> Is this a known problem and fixed in OFED 1.1 ?
>>
>> [root at vortex3l-72 158]# ibv_devinfo
>> hca_id: mthca0
>> fw_ver: 5.1.400
>> node_guid: 0050:4501:4b1a:0000
>> sys_image_guid: 0050:4501:4b1a:0003
>> vendor_id: 0x02c9
>> vendor_part_id: 25218
>> hw_ver: 0xA0
>> board_id: ARM0020000001
>> phys_port_cnt: 2
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 2048 (4)
>> active_mtu: 2048 (4)
>> sm_lid: 7
>> port_lid: 1
>> port_lmc: 0x00
>>
>> port: 2
>> state: PORT_ACTIVE (4)
>> max_mtu: 2048 (4)
>> active_mtu: 2048 (4)
>> sm_lid: 1
>> port_lid: 1
>> port_lmc: 0x00
>>
>> [root at vortex3l-72 158]# ps -aux | grep open
>> Warning: bad syntax, perhaps a bogus '-'? See
>> /usr/share/doc/procps-3.2.3/FAQ
>> root 7988 0.0 0.0 92784 1672 ? Sl Nov22 0:06
>> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f
>> /var/log/opensm2.log
>> root 7975 0.0 0.0 92784 1572 ? Sl Nov22 0:06
>> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f
>> /var/log/opensm1.log
>> root 7803 0.0 0.0 51096 668 pts/0 S+ 12:11 0:00 grep open
>> [root at vortex3l-72 158]# strace -p7975
>> Process 7975 attached - interrupt to quit
>> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
>> nanosleep({10, 0}, NULL) = 0
>> nanosleep({10, 0}, NULL) = 0
>> nanosleep({10, 0}, NULL) = 0
>> nanosleep({10, 0}, NULL) = 0
>> nanosleep({10, 0}, NULL) = 0
>> nanosleep({10, 0}, NULL) = 0
>> nanosleep({10, 0}, NULL) = 0
>> nanosleep({10, 0}, NULL) = 0
>> nanosleep({10, 0}, <unfinished ...>
>> Process 7975 detached
>> [root at vortex3l-72 158]# uptime
>> 12:13:02 up 4 days, 17:05, 5 users, load average: 0.00, 0.00, 0.00
>> [root at vortex3l-72 158]# date
>> Mon Nov 27 12:13:05 PST 2006
>> [root at vortex3l-72 158]# tail /var/log/opensm1.log
>> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn
>> 3673M
>>
>> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting
>> Generic Notice type:3 num:66 from LID:0x0000
>> GID:0xfe80000000000000,0x0000000000000000
>> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting
>> Generic Notice type:3 num:66 from LID:0x0000
>> GID:0xfe80000000000000,0x0000000000000000
>> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port
>> 0x5045014b1a0001
>> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port
>> 0x5045014b1a0001
>> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
>>
>> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
>>
>> [root at vortex3l-72 158]# tail /var/log/opensm2.log
>> 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00
>>
>> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting
>> Generic Notice type:3 num:65 from LID:0x0001
>> GID:0xfe80000000000000,0x005045014b1a0002
>> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec:
>> Cannot find destination port with LID:0x0002
>> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec:
>> Cannot find destination port with LID:0x0003
>> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec:
>> Cannot find destination port with LID:0x0004
>> Nov 27 12:10:32 146382 [41401960] -> Removed port with
>> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
>> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108:
>> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to
>> light sweep sampling list
>> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
>> Path = [0][2]
>>
More information about the general
mailing list