[openib-general] Unreliable OpemSM failover

Venkatesh Babu venkatesh.babu at 3leafnetworks.com
Mon Nov 27 16:03:20 PST 2006


Hi

   I have topology of two switches and a bunch of nodes, with each node 
having 2port HCAs. Port1 of every node connects to switch1 and Port2 of 
every node connects to switch2. So Port1 and Port2 are in different 
subnets. So I am running one OpenSM (from OFED 1.0) for each port on one 
node designated as a server. To guard against that server going down I 
have another server node to run the OpenSM in "standby" mode for each 
port. I will adjust the priorities such that first server always has 
"master" OpenSM and second server has "standby" OpenSM.

    When the first server rebooted, "standby" OpenSM should takeover the 
mastership. It usually works fine but sometimes it is failing to 
takeover. In the following example OpenSM for Port1 failed to takeover, 
but OpenSM for Port2 took over and became "master". The OpenSM for Port1 
seems be stuck in some weired state (strace shows that it is sleeping). 
It is no longer assigning LIDs to the rest of the nodes in the subnet 
and not responding to the broadcast joins. The log file shows nothing 
from past 4 days. I have the complete log files if needed.

    Is this a known problem and fixed in OFED 1.1 ?

 [root at vortex3l-72 158]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.1.400
        node_guid:                      0050:4501:4b1a:0000
        sys_image_guid:                 0050:4501:4b1a:0003
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       ARM0020000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 7
                        port_lid:               1
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00

[root at vortex3l-72 158]# ps -aux | grep open
Warning: bad syntax, perhaps a bogus '-'? See 
/usr/share/doc/procps-3.2.3/FAQ
root      7988  0.0  0.0 92784 1672 ?        Sl   Nov22   0:06 
/usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f /var/log/opensm2.log
root      7975  0.0  0.0 92784 1572 ?        Sl   Nov22   0:06 
/usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f /var/log/opensm1.log
root      7803  0.0  0.0 51096  668 pts/0    S+   12:11   0:00 grep open
[root at vortex3l-72 158]# strace -p7975
Process 7975 attached - interrupt to quit
restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0},  <unfinished ...>
Process 7975 detached
[root at vortex3l-72 158]# uptime
 12:13:02 up 4 days, 17:05,  5 users,  load average: 0.00, 0.00, 0.00
[root at vortex3l-72 158]# date
Mon Nov 27 12:13:05 PST 2006
[root at vortex3l-72 158]#  tail /var/log/opensm1.log
Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 3673M

Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting 
Generic Notice type:3 num:66 from LID:0x0000 
GID:0xfe80000000000000,0x0000000000000000
Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting 
Generic Notice type:3 num:66 from LID:0x0000 
GID:0xfe80000000000000,0x0000000000000000
Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port 
0x5045014b1a0001
Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port 
0x5045014b1a0001
Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state

Nov 22 19:09:28 442435 [0000] -> Entering MASTER state

[root at vortex3l-72 158]#  tail /var/log/opensm2.log
                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting 
Generic Notice type:3 num:65 from LID:0x0001 
GID:0xfe80000000000000,0x005045014b1a0002
Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: Cannot 
find destination port with LID:0x0002
Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: Cannot 
find destination port with LID:0x0003
Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: Cannot 
find destination port with LID:0x0004
Nov 27 12:10:32 146382 [41401960] -> Removed port with 
GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: 
Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to light 
sweep sampling list
Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
                                Path = [0][2]






More information about the general mailing list