[ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master
Hal Rosenstock
halr at voltaire.com
Mon May 21 15:16:38 PDT 2007
On Mon, 2007-05-21 at 18:00, Venkatesh Babu wrote:
> Configuration:
> - 4 nodes in the IB network with two nodes running OpenSMs.
> - Each node has MT25218 InfiniHostEx Mellanox with two IB ports and
> with firmware version 5.2.0
> - All node's IB port 1 is connected to IB Switch1, say subnet1
> - All node's IB port 2 is connected to IB Switch2, say subnet2
So there is no link between the 2 switches, right ?
> - vortex3l-83 has two opensm's for each subnet with priority 0
> - vortex3l-84 has two opensm's for each subnet with priority 13
>
> Problem:
>
> The problem is opensm's on both the machines are in Standy state and none of them are
> claiming the mastership, though they have different priorities 0 and 13. Most of the times
> this configuration works fine, but ocassionally it is getting into this problem. It is hard
> to reproduce this problem.
Is there anything being done ? Cables pulled and reinserted ? Is
anything changing or is this a "stable" configuration in terms of the
topology ?
Is this the only thing going on on the subnet ?
> I tried to set the mastership of the opensm but it didn't worked.
> [root at vortex3l-83 ~]# sminfo -s3
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5043 priority 0
> state 2 SMINFO_STANDBY
>
> After couple of minutes
> [root at vortex3l-83 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 5938 priority 0 state 3 SMINFO_MASTER
So it did finally become master ?
I take it LID 6 is local (vortex31-83).
> Data:
>
> [root at vortex3l-83 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4937 priority 0 state
> 2 SMINFO_STANDBY
> [root at vortex3l-83 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ
> root 5237 0.0 0.0 92848 1692 ? Sl 00:39 0:00 /usr/bin/opensm
> -g 0x005045014a3a0001 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm1.log
> root 5250 0.0 0.0 92848 1700 ? Sl 00:39 0:00 /usr/bin/opensm
> -g 0x005045014a3a0002 -p 0 -s 10 -R updn -L 100 -f /var/log/opensm2.log
> root 8356 0.0 0.0 51064 708 pts/0 S+ 13:40 0:00 grep open
>
>
> [root at vortex3l-84 ~]# sminfo
> sminfo: sm lid 6 sm guid 0x5045014a3a0001, activity count 4939 priority 0 state
> 2 SMINFO_STANDBY
> [root at vortex3l-84 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.3/FAQ
> root 5871 0.0 0.0 92848 1560 ? Sl 00:40 0:00 /usr/bin/opensm
> -g 0x005045014a2e0001 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm1.log
> root 5884 0.0 0.0 92848 1568 ? Sl 00:40 0:00 /usr/bin/opensm
> -g 0x005045014a2e0002 -p 13 -s 10 -R updn -L 100 -f /var/log/opensm2.log
> root 8845 0.0 0.0 51084 668 pts/0 S+ 13:40 0:00 grep open
>
>
> But ibv_devinfo on vortex3l-83 shows that both ports are active and sm_lid and
> lid are same, indicating it is master. Looks like it is the stale information.
>
> [root at vortex3l-83 ~]# ibv_devinfo
> hca_id: mthca0
> fw_ver: 5.2.0
> node_guid: 0050:4501:4a3a:0000
> sys_image_guid: 0050:4501:4a3a:0003
> vendor_id: 0x02c9
> vendor_part_id: 25218
> hw_ver: 0xA0
> board_id: ARM0020000001
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 6
> port_lid: 6
> port_lmc: 0x00
>
> port: 2
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 1
> port_lmc: 0x00
>
> [root at vortex3l-84 ~]# ibv_devinfo
> hca_id: mthca0
> fw_ver: 5.2.0
> node_guid: 0050:4501:4a2e:0000
> sys_image_guid: 0050:4501:4a2e:0003
> vendor_id: 0x02c9
> vendor_part_id: 25218
> hw_ver: 0xA0
> board_id: ARM0020000001
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 6
> port_lid: 7
> port_lmc: 0x00
>
> port: 2
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 1
> port_lid: 2
> port_lmc: 0x00
>
>
>
> And also in /var/log/opensm[1/2].log I see the following error messages -
>
> May 21 00:40:28 250119 [95A9A160] -> OpenSM Rev:openib-2.0.5 OpenIB svn 4954M
This looks like a pretty old OpenSM. Is it OFED 1.1 or older ? Can you
try OFED 1.2 ?
What kernel is being used ? What distro ? What processor architecture ?
> May 21 00:40:28 484648 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001
> May 21 00:40:28 487418 [95A9A160] -> osm_vendor_bind: Binding to port 0x5045014a2e0001
> May 21 00:40:29 292689 [45007960] -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x100000125f) -- dropping
> May 21 00:40:29 292728 [45007960] -> umad_receiver: ERR 5411: DR SMP
> May 21 00:40:29 292741 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT)
> May 21 00:40:29 292818 [45007960] -> SMP dump:
Is this around the time of the error or just an error in the OpenSM log
?
> I found that for both ports on both vortex boxes I see the port_xmit_discards
> counter was 1.
Did this change from 0 to 1 around the time of the issue with the SM
mastership ?
Also, what are the port counters for the switch ports in use ?
> Other error counters seems to be zero. Looks like some packets
> has been transmitted and received on both machines.
>
> [root at vortex3l-83 ~]# cat
> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards
> 1
> [root at vortex3l-83 ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
>
> [root at vortex3l-84 ~]# cat
> /sys/class/infiniband/mthca0/ports/1/counters/port_xmit_discards
> 1
> [root at vortex3l-84 ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
>
> [root at sqasmd ~]# cat
> /sys/class/infiniband/mthca0/ports/2/counters/port_xmit_discards
> 1
>
>
> Link speed seems to be set to 10 Gb/sec (4X) on all machines.
>
>
> I have the opensm logs and gdb output for all the opensms. If you want I
> can send it to you.
Perhaps later; not just yet.
> Just attaching one sample gdb output with stack traces of all threads.
Are they all the same ?
-- Hal
> [root at vortex3l-83 ~]# gdb /usr/bin/opensm 5237
> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (no debugging symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
>
> Attaching to program: /usr/bin/opensm, process 5237
> Reading symbols from /usr/lib/libopensm.so.1...done.
> Loaded symbols for /usr/lib/libopensm.so.1
> Reading symbols from /usr/lib/libosmcomp.so.1...done.
> Loaded symbols for /usr/lib/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182904123744 (LWP 5237)]
> [New Thread 1157658976 (LWP 5267)]
> [New Thread 1147169120 (LWP 5266)]
> [New Thread 1136679264 (LWP 5265)]
> [New Thread 1126189408 (LWP 5264)]
> [New Thread 1115699552 (LWP 5263)]
> [New Thread 1105209696 (LWP 5262)]
> [New Thread 1094719840 (LWP 5261)]
> [New Thread 1084229984 (LWP 5253)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/lib/libosmvendor.so.2...done.
> Loaded symbols for /usr/lib/libosmvendor.so.2
> Reading symbols from /usr/lib/libibcommon.so.1...done.
> Loaded symbols for /usr/lib/libibcommon.so.1
> Reading symbols from /usr/lib/libibumad.so.1...done.
> Loaded symbols for /usr/lib/libibumad.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) info threads
> 9 Thread 1084229984 (LWP 5253) 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 8 Thread 1094719840 (LWP 5261) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 7 Thread 1105209696 (LWP 5262) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 6 Thread 1115699552 (LWP 5263) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 5 Thread 1126189408 (LWP 5264) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 4 Thread 1136679264 (LWP 5265) 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 3 Thread 1147169120 (LWP 5266) 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 2 Thread 1157658976 (LWP 5267) 0x0000002a95d7fd22 in poll ()
> from /lib64/tls/libc.so.6
> 1 Thread 182904123744 (LWP 5237) 0x0000002a95d51d65 in __nanosleep_nocancel
> () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182904123744 (LWP 5237))]#0 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x0000002a95d51d65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1 0x0000002a95d82368 in usleep () from /lib64/tls/libc.so.6
> #2 0x0000002a9578a05e in cl_thread_suspend (pause_ms=10000) at cl_thread.c:125
> #3 0x0000000000405ba1 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 5267))]#0 0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x0000002a95d7fd22 in poll () from /lib64/tls/libc.so.6
> #1 0x0000002a95bbb94d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2 0x0000002a95bbba6d in umad_recv (portid=Variable "portid" is not available.
> ) at src/umad.c:805
> #3 0x0000002a959ae68b in umad_receiver (p_ptr=0x5c3000)
> at osm_vendor_ibumad.c:266
> #4 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5c3070) at cl_thread.c:61
> #5 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #6 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #7 0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 5266))]#0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a95783b4b in cl_event_wait_on (p_event=0x5887b8,
> wait_us=10000000, interruptible=1) at cl_event.c:181
> #2 0x000000000043630c in __osm_sm_sweeper ()
> #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x588898) at cl_thread.c:61
> #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 5265))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a278,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x000000000044d7a1 in __osm_vl15_poller ()
> #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58a2e8) at cl_thread.c:61
> #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 5264))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
> at cl_threadpool.c:71
> #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x5900e0) at cl_thread.c:61
> #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 5263))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
> at cl_threadpool.c:71
> #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x590010) at cl_thread.c:61
> #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 5262))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
> at cl_threadpool.c:71
> #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58ff40) at cl_thread.c:61
> #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 5261))]#0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000002a9589f8da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a95783ab9 in cl_event_wait_on (p_event=0x58a560,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a9578a10a in __cl_thread_pool_routine (context=0x58a488)
> at cl_threadpool.c:71
> #3 0x0000002a95789f7a in __cl_thread_wrapper (arg=0x58b760) at cl_thread.c:61
> #4 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 5253))]#0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000002a9589facf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9578a9dd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
> #2 0x0000002a9589d0aa in start_thread () from /lib64/tls/libpthread.so.0
> #3 0x0000002a95d88b43 in clone () from /lib64/tls/libc.so.6
> #4 0x0000000000000000 in ?? ()
> (gdb) thread 10
> Thread ID 10 not known.
> (gdb)
> Thread ID 10 not known.
> (gdb) q
> The program is running. Quit anyway (and detach it)? (y or n) y
> Detaching from program: /usr/bin/opensm, process 5237
> [root at vortex3l-83 ~]#
>
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list