[ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master

Hal Rosenstock halr at voltaire.com
Mon May 21 15:16:38 PDT 2007


On Mon, 2007-05-21 at 18:00, Venkatesh Babu wrote:
> Configuration:
>  - 4 nodes in the IB network with two nodes running OpenSMs.
>  - Each node has MT25218 InfiniHostEx Mellanox with two IB ports and 
> with firmware version 5.2.0
>  - All node's IB port 1 is connected to IB Switch1, say subnet1
>  - All node's IB port 2 is connected to IB Switch2, say subnet2

So there is no link between the 2 switches, right ?

>  - vortex3l-83 has two opensm's for each subnet with priority 0
>  - vortex3l-84 has two opensm's for each subnet with priority 13
> 
> Problem:
> 
> The problem is opensm's on both the machines are in Standy state and none of them are
> claiming the mastership, though they have different priorities 0 and 13. Most of the times
> this configuration works fine, but ocassionally it is getting into this problem. It is hard
> to reproduce this problem. 

Is there anything being done ? Cables pulled and reinserted ? Is
anything changing or is this a "stable" configuration in terms of the
topology ?

Is this the only thing going on on the subnet ?

> I tried to set the mastership of the opensm but it didn't worked.

> [root at vortex3l-83 ~]# sminfo -s3
>  sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5043 priority 0
> state 2 SMINFO_STANDBY
> 
> After couple of minutes 
> [root at vortex3l-83 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5938 priority 0 state 3 SMINFO_MASTER

So it did finally become master ?

I take it LID 6 is local (vortex31-83).

> Data:
> 
> [root at vortex3l-83 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4937 priority 0 state
> 2 SMINFO_STANDBY
> [root at vortex3l-83 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ
> root      5237  0.0  0.0 92848 1692 ?        Sl   00:39   0:00 /usr/bin/opensm
> -g 0x005045014a3a0001 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm1.log
> root      5250  0.0  0.0 92848 1700 ?        Sl   00:39   0:00 /usr/bin/opensm
> -g 0x005045014a3a0002 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm2.log
> root      8356  0.0  0.0 51064  708 pts/0    S+   13:40   0:00 grep open
> 
> 
> [root at vortex3l-84 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4939 priority 0 state
> 2 SMINFO_STANDBY
> [root at vortex3l-84 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ
> root      5871  0.0  0.0 92848 1560 ?        Sl   00:40   0:00 /usr/bin/opensm
> -g 0x005045014a2e0001 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm1.log
> root      5884  0.0  0.0 92848 1568 ?        Sl   00:40   0:00 /usr/bin/opensm
> -g 0x005045014a2e0002 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm2.log
> root      8845  0.0  0.0 51084  668 pts/0    S+   13:40   0:00 grep open
> 
> 
> But ibv_devinfo on vortex3l-83 shows that both ports are active and sm_lid and
> lid are same, indicating it is master. Looks like it is the stale information.
> 
> [root at vortex3l-83 ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.2.0
>         node_guid:                      0050:4501:4a3a:0000
>         sys_image_guid:                 0050:4501:4a3a:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 6
>                         port_lid:               6
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               1
>                         port_lmc:               0x00
> 
> [root at vortex3l-84 ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.2.0
>         node_guid:                      0050:4501:4a2e:0000
>         sys_image_guid:                 0050:4501:4a2e:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 6
>                         port_lid:               7
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 1
>                         port_lid:               2
>                         port_lmc:               0x00
> 
> 
> 
> And also in /var/log/opensm[1/2].log I see the following error messages -
> 
> May 21 00:40:28 250119 [95A9A160] -> OpenSM Rev:openib-2.0.5 OpenIB svn 4954M

This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you
try OFED 1.2 ?

What kernel is being used ? What distro ? What processor architecture ?

> May 21 00:40:28 484648 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001
> May 21 00:40:28 487418 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001
> May 21 00:40:29 292689 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x100000125f) -- dropping
> May 21 00:40:29 292728 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 00:40:29 292741 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
> May 21 00:40:29 292818 [45007960] -> SMP dump:

Is this around the time of the error or just an error in the OpenSM log
? 

> I found that for both ports on both vortex boxes I see the port_xmit_discards
> counter was 1.

Did this change from 0 to 1 around the time of the issue with the SM
mastership ?

Also, what are the port counters for the switch ports in use ?

>  Other error counters seems to be zero. Looks like some packets
> has been transmitted and received on both machines.
> 
> [root at vortex3l-83 ~]# cat
> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards
> 1
> [root at vortex3l-83 ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
> 
> [root at vortex3l-84 ~]# cat
> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards
> 1
> [root at vortex3l-84 ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
> 
> [root at sqasmd ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
> 
> 
> Link speed seems to be set to 10 Gb/sec (4X) on all machines.
> 
> 
> I have the opensm logs and gdb output for all the opensms. If you want I 
> can send it to you.

Perhaps later; not just yet.

> Just attaching one sample gdb output with stack traces of all threads.

Are they all the same ?

-- Hal

> [root at vortex3l-83 ~]# gdb /usr/bin/opensm 5237
> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (no debugging symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> 
> Attaching to program: /usr/bin/opensm, process 5237
> Reading symbols from /usr/lib/libopensm.so.1...done.
> Loaded symbols for /usr/lib/libopensm.so.1
> Reading symbols from /usr/lib/libosmcomp.so.1...done.
> Loaded symbols for /usr/lib/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182904123744 (LWP 5237)]
> [New Thread 1157658976 (LWP 5267)]
> [New Thread 1147169120 (LWP 5266)]
> [New Thread 1136679264 (LWP 5265)]
> [New Thread 1126189408 (LWP 5264)]
> [New Thread 1115699552 (LWP 5263)]
> [New Thread 1105209696 (LWP 5262)]
> [New Thread 1094719840 (LWP 5261)]
> [New Thread 1084229984 (LWP 5253)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/lib/libosmvendor.so.2...done.
> Loaded symbols for /usr/lib/libosmvendor.so.2
> Reading symbols from /usr/lib/libibcommon.so.1...done.
> Loaded symbols for /usr/lib/libibcommon.so.1
> Reading symbols from /usr/lib/libibumad.so.1...done.
> Loaded symbols for /usr/lib/libibumad.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) info threads
>   9 Thread 1084229984 (LWP 5253)  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   8 Thread 1094719840 (LWP 5261)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   7 Thread 1105209696 (LWP 5262)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   6 Thread 1115699552 (LWP 5263)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   5 Thread 1126189408 (LWP 5264)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   4 Thread 1136679264 (LWP 5265)  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   3 Thread 1147169120 (LWP 5266)  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   2 Thread 1157658976 (LWP 5267)  0x0000002a95d7fd22 in poll ()
>    from /lib64/tls/libc.so.6
>   1 Thread 182904123744 (LWP 5237)  0x0000002a95d51d65 in __nanosleep_nocancel
>     () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182904123744 (LWP 5237))]#0  0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x0000002a95d82368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9578a05e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125
> #3  0x0000000000405ba1 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 5267))]#0  0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6
> #1  0x0000002a95bbb94d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2  0x0000002a95bbba6d in umad_recv (portid=Variable "portid" is not available.
> ) at src/umad.c:805
> #3  0x0000002a959ae68b in umad_receiver (p_ptr=0x5c3000)
>     at osm_vendor_ibumad.c:266
> #4  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5c3070) at cl_thread.c:61
> #5  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #6  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #7  0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 5266))]#0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783b4b in cl_event_wait_on (p_event=0x5887b8,
>     wait_us=10000000, interruptible=1) at cl_event.c:181
> #2  0x000000000043630c in __osm_sm_sweeper ()
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x588898) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 5265))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a278,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x000000000044d7a1 in __osm_vl15_poller ()
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58a2e8) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 5264))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
>     at cl_threadpool.c:71
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 5263))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
>     at cl_threadpool.c:71
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 5262))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
>     at cl_threadpool.c:71
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 5261))]#0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
>     at cl_threadpool.c:71
> #3  0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61
> #4  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 5253))]#0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9578a9dd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
> #2  0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #3  0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #4  0x0000000000000000 in ?? ()
> (gdb) thread 10
> Thread ID 10 not known.
> (gdb)
> Thread ID 10 not known.
> (gdb) q
> The program is running.  Quit anyway (and detach it)? (y or n) y
> Detaching from program: /usr/bin/opensm, process 5237
> [root at vortex3l-83 ~]#
> 
> 
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list