[openib-general] Unreliable OpemSM failover

Hal Rosenstock halr at voltaire.com
Mon Nov 27 16:48:07 PST 2006


Hi,

On Mon, 2006-11-27 at 19:03, Venkatesh Babu wrote:
> Hi
> 
>    I have topology of two switches and a bunch of nodes, with each node 
> having 2port HCAs. Port1 of every node connects to switch1 and Port2 of 
> every node connects to switch2. So Port1 and Port2 are in different 
> subnets. 

Are the two switches not connected to each other ?

> So I am running one OpenSM (from OFED 1.0) for each port on one 
> node designated as a server. To guard against that server going down I 
> have another server node to run the OpenSM in "standby" mode for each 
> port. I will adjust the priorities such that first server always has 
> "master" OpenSM and second server has "standby" OpenSM.

Are the subnet prefixes configured ?

>     When the first server rebooted, "standby" OpenSM should takeover the 
> mastership. It usually works fine but sometimes it is failing to 
> takeover. In the following example OpenSM for Port1 failed to takeover, 
> but OpenSM for Port2 took over and became "master". The OpenSM for Port1 
> seems be stuck in some weired state (strace shows that it is sleeping). 
> It is no longer assigning LIDs to the rest of the nodes in the subnet 
> and not responding to the broadcast joins. The log file shows nothing 
> from past 4 days. I have the complete log files if needed.
> 
>     Is this a known problem and fixed in OFED 1.1 ?

I don't see any explicit changes to the SM state machine which would
affect this but as I have mentioned before there are many bug fixes in
OFED 1.1. I can't conclusively state whether this would fix the issue
you see but would be in a much better position to try to figure this
out.

-- Hal

>  [root at vortex3l-72 158]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.1.400
>         node_guid:                      0050:4501:4b1a:0000
>         sys_image_guid:                 0050:4501:4b1a:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 7
>                         port_lid:               1
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               1
>                         port_lmc:               0x00
> 
> [root at vortex3l-72 158]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See 
> /usr/share/doc/procps-3.2.3/FAQ
> root      7988  0.0  0.0 92784 1672 ?        Sl   Nov22   0:06 
> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f /var/log/opensm2.log
> root      7975  0.0  0.0 92784 1572 ?        Sl   Nov22   0:06 
> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f /var/log/opensm1.log
> root      7803  0.0  0.0 51096  668 pts/0    S+   12:11   0:00 grep open
> [root at vortex3l-72 158]# strace -p7975
> Process 7975 attached - interrupt to quit
> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0},  <unfinished ...>
> Process 7975 detached
> [root at vortex3l-72 158]# uptime
>  12:13:02 up 4 days, 17:05,  5 users,  load average: 0.00, 0.00, 0.00
> [root at vortex3l-72 158]# date
> Mon Nov 27 12:13:05 PST 2006
> [root at vortex3l-72 158]#  tail /var/log/opensm1.log
> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 3673M
> 
> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting 
> Generic Notice type:3 num:66 from LID:0x0000 
> GID:0xfe80000000000000,0x0000000000000000
> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting 
> Generic Notice type:3 num:66 from LID:0x0000 
> GID:0xfe80000000000000,0x0000000000000000
> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port 
> 0x5045014b1a0001
> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port 
> 0x5045014b1a0001
> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
> 
> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
> 
> [root at vortex3l-72 158]#  tail /var/log/opensm2.log
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 
> 00 00 00
> 
> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting 
> Generic Notice type:3 num:65 from LID:0x0001 
> GID:0xfe80000000000000,0x005045014b1a0002
> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: Cannot 
> find destination port with LID:0x0002
> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: Cannot 
> find destination port with LID:0x0003
> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: Cannot 
> find destination port with LID:0x0004
> Nov 27 12:10:32 146382 [41401960] -> Removed port with 
> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: 
> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to light 
> sweep sampling list
> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
>                                 Path = [0][2]
> 
> 





More information about the general mailing list