[openib-general] Unreliable OpemSM failover
Hal Rosenstock
halr at voltaire.com
Fri Dec 8 17:38:48 PST 2006
On Fri, 2006-12-08 at 20:03, Venkatesh Babu wrote:
> Now I hit another instance of the problem. Now I have more information.
Was this the same scenario or something different ?
> Node1:
> ======
>
> [root at vortex3l-71 ~]# ibv_devinfo
> hca_id: mthca0
> fw_ver: 5.1.400
> node_guid: 0050:4501:4a5a:0000
So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
that right ?
> sys_image_guid: 0050:4501:4a5a:0003
> vendor_id: 0x02c9
> vendor_part_id: 25218
> hw_ver: 0xA0
> board_id: ARM0020000001
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 2
> port_lid: 7
> port_lmc: 0x00
>
> port: 2
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 4
> port_lid: 4
> port_lmc: 0x00
>
> [root at vortex3l-71 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See
> /usr/share/doc/procps-3.2.3/FAQ
> root 6774 0.0 0.0 92844 1684 ? Sl Dec07 0:06
> /usr/local/ofed/bin/opensm -g 0x005045014a5a0001 -p 1 -s 10 -u -f
> /var/log/opensm1.log
> root 21537 0.0 0.4 64556 9276 ttyS0 S+ 16:48 0:00 gdb
> /usr/local/ofed/bin/opensm 6787
> root 6787 0.0 0.0 92844 1728 ? Tl Dec07 0:05
> /usr/local/ofed/bin/opensm -g 0x005045014a5a0002 -p 1 -s 10 -u -f
> /var/log/opensm2.log
> root 22566 0.0 0.0 51072 692 pts/0 S+ 16:53 0:00 grep open
> [root at vortex3l-71 ~]# tail /var/log/opensm2.log
>
> 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
>
> 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
>
> Dec 07 11:29:14 623895 [45007960] -> umad_receiver: ERR 5404: recv error
> on MAD sized umad (Interrupted system call)
> Dec 07 11:29:14 625421 [0000] -> Exiting SM
Does this correspond to when node 2 SM goes down, SM comes up, or
something else ?
Not sure why OpenSM decides to exit (due to this error which should be
recoverable). It then fails to exit (hangs) as the other threads are not
terminated.
Is osm_exit_flag set ? I presume it is but would like verification.
What are the thread_state values of the various threads ?
> [root at vortex3l-71 ~]#
> [root at vortex3l-71 ~]# gdb /usr/local/ofed/bin/opensm 6787
> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (no debugging symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
>
> Attaching to program: /usr/local/ofed/bin/opensm, process 6787
> Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
> Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182899544416 (LWP 6787)]
> [New Thread 1157658976 (LWP 6797)]
> [New Thread 1147169120 (LWP 6796)]
> [New Thread 1136679264 (LWP 6795)]
> [New Thread 1126189408 (LWP 6794)]
> [New Thread 1115699552 (LWP 6793)]
> [New Thread 1105209696 (LWP 6792)]
> [New Thread 1094719840 (LWP 6791)]
> [New Thread 1084229984 (LWP 6789)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
> Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
> Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1 0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
> #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at
> cl_thread.c:125
> #3 0x0000000000405b71 in main ()
> (gdb) info threads
> 9 Thread 1084229984 (LWP 6789) 0x0000003858c088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 8 Thread 1094719840 (LWP 6791) 0x0000003858c088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 7 Thread 1105209696 (LWP 6792) 0x0000003858c088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 6 Thread 1115699552 (LWP 6793) 0x0000003858c088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 5 Thread 1126189408 (LWP 6794) 0x0000003858c088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 4 Thread 1136679264 (LWP 6795) 0x0000003858c088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 3 Thread 1147169120 (LWP 6796) 0x0000003858c08acf in
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 2 Thread 1157658976 (LWP 6797) 0x0000003857fbcd22 in poll ()
> from /lib64/tls/libc.so.6
> 1 Thread 182899544416 (LWP 6787) 0x0000003857f8ed65 in
> __nanosleep_nocancel
> () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182899544416 (LWP 6787))]#0
> 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1 0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
> #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at
> cl_thread.c:125
> #3 0x0000000000405b71 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 6797))]#0
> 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
> #1 0x0000002a9588d90d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2 0x0000002a9588da2d in umad_recv (portid=Variable "portid" is not
> available.
> ) at src/umad.c:805
> #3 0x0000002a9578367b in umad_receiver (p_ptr=0x5c2d50)
> at osm_vendor_ibumad.c:266
> #4 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at
> cl_thread.c:61
> #5 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #6 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #7 0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 6796))]#0
> 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
> wait_us=10000000, interruptible=1) at cl_event.c:181
> #2 0x00000000004362dc in __osm_sm_sweeper ()
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at
> cl_thread.c:61
> #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 6795))]#0
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x000000000044d771 in __osm_vl15_poller ()
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at
> cl_thread.c:61
> #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 6794))]#0
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
> at cl_threadpool.c:71
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at
> cl_thread.c:61
> #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 6793))]#0
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
> at cl_threadpool.c:71
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at
> cl_thread.c:61
> #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 6792))]#0
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
> at cl_threadpool.c:71
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at
> cl_thread.c:61
> #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 6791))]#0
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
> at cl_threadpool.c:71
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at
> cl_thread.c:61
> #4 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 6789))]#0
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a95675991 in __cl_timer_prov_cb (context=0x0) at cl_timer.c:157
> #2 0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #3 0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #4 0x0000000000000000 in ?? ()
> (gdb)
>
>
> Node 2:
> ======
Is this when node 2 comes back up and SM is restarted on both ports or
is it after the SM is stopped on port 2 ?
> [root at localhost ~]# ibv_devinfo
> hca_id: mthca0
> fw_ver: 5.1.400
> node_guid: 0050:4501:4a9e:0000
> sys_image_guid: 0050:4501:4a9e:0003
> vendor_id: 0x02c9
> vendor_part_id: 25218
> hw_ver: 0xA0
> board_id: ARM0020000001
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 2
> port_lid: 2
> port_lmc: 0x00
>
> port: 2
> state: PORT_INIT (2)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 4
This port still points at the SM on node 1, right ?
> port_lid: 2
> port_lmc: 0x00
>
> [root at localhost ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See
> /usr/share/doc/procps-3.2.3/FAQ
> root 6854 0.0 0.0 92844 1648 ? Sl 16:12 0:00
> /usr/local/ofed/bin/opensm -g 0x005045014a9e0001 -p 8 -s 10 -u -f
> /var/log/opensm1.log
> root 14005 0.0 0.4 64632 9312 ttyS0 S+ 16:46 0:00 gdb
> /var/log/opensm2.log 6867
> root 6867 0.0 0.0 92844 1536 ? Tl 16:12 0:00
> /usr/local/ofed/bin/opensm -g 0x005045014a9e0002 -p 8 -s 10 -u -f
> /var/log/opensm2.log
> root 16223 0.0 0.0 51060 680 pts/0 S+ 16:56 0:00 grep open
> [root at localhost ~]# tail /var/log/opensm2.log
> Dec 07 05:15:07 675863 [41401960] -> osm_subn_set_up_down_min_hop_table:
> BFS through all port guids in the subnet ]
> Dec 07 05:15:07 675898 [41401960] -> osm_ucast_mgr_process: Min Hop
> Tables configured on all switches
> Dec 07 05:15:07 682095 [43204960] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25:
> Received an invalid delete request on MGID: 0xff12401bffff0000 :
> 0x00000000ffffffff for PortGID: 0xfe80000000000000 : 0x0050450148ba0002
> Dec 07 05:15:07 677004 [0000] -> SUBNET UP
>
> Dec 07 05:15:09 598888 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x005045014a9e0002
> Dec 07 07:26:17 429099 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x0050450148ba0002
> Dec 07 07:26:18 429309 [41E02960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> 0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
> Dec 07 11:29:03 817752 [0000] -> Exiting SM
You stopped this SM, right ?
> [root at localhost ~]#
> [root at localhost ~]# gdb /var/log/opensm2.log 6867
Why gdb this node's SM ? I'm not following you.
Should point at executable not log.
> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty" for details.
> This GDB was configured as
> "x86_64-redhat-linux-gnu"..."/var/log/opensm2.log": not in executable
> format: File format not recognized
>
> Attaching to process 6867
> Reading symbols from /usr/local/ofed/bin/opensm...(no debugging symbols
> found)...done.
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
> Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182899548512 (LWP 6867)]
> [New Thread 1157658976 (LWP 6884)]
> [New Thread 1147169120 (LWP 6883)]
> [New Thread 1136679264 (LWP 6882)]
> [New Thread 1126189408 (LWP 6881)]
> [New Thread 1115699552 (LWP 6880)]
> [New Thread 1105209696 (LWP 6879)]
> [New Thread 1094719840 (LWP 6878)]
> [New Thread 1084229984 (LWP 6869)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
> Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
> Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x00000032eec8ed65 in __nanosleep_nocancel ()
> from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1 0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
> #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at
> cl_thread.c:125
> #3 0x0000000000405b71 in main ()
> (gdb) info threads
> 9 Thread 1084229984 (LWP 6869) 0x00000032ef908acf in
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 8 Thread 1094719840 (LWP 6878) 0x00000032ef9088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 7 Thread 1105209696 (LWP 6879) 0x00000032ef9088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 6 Thread 1115699552 (LWP 6880) 0x00000032ef9088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 5 Thread 1126189408 (LWP 6881) 0x00000032ef9088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 4 Thread 1136679264 (LWP 6882) 0x00000032ef9088da in
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 3 Thread 1147169120 (LWP 6883) 0x00000032ef908acf in
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
> 2 Thread 1157658976 (LWP 6884) 0x00000032eecbcd22 in poll ()
> from /lib64/tls/libc.so.6
> 1 Thread 182899548512 (LWP 6867) 0x00000032eec8ed65 in
> __nanosleep_nocancel
> () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182899548512 (LWP 6867))]#0
> 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1 0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
> #2 0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at
> cl_thread.c:125
> #3 0x0000000000405b71 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 6884))]#0
> 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
> #1 0x0000002a9588e90d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2 0x0000002a9588ea2d in umad_recv (portid=Variable "portid" is not
> available.
> ) at src/umad.c:805
> #3 0x0000002a9578467b in umad_receiver (p_ptr=0x5c2d50)
> at osm_vendor_ibumad.c:266
> #4 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at
> cl_thread.c:61
> #5 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #6 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #7 0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 6883))]#0
> 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
> wait_us=10000000, interruptible=1) at cl_event.c:181
> #2 0x00000000004362dc in __osm_sm_sweeper ()
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at
> cl_thread.c:61
> #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 6882))]#0
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x000000000044d771 in __osm_vl15_poller ()
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at
> cl_thread.c:61
> #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 6881))]#0
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
> at cl_threadpool.c:71
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at
> cl_thread.c:61
> #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 6880))]#0
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
> at cl_threadpool.c:71
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at
> cl_thread.c:61
> #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 6879))]#0
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
> at cl_threadpool.c:71
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at
> cl_thread.c:61
> #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 6878))]#0
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
> wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2 0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
> at cl_threadpool.c:71
> #3 0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at
> cl_thread.c:61
> #4 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6 0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 6869))]#0
> 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
> from /lib64/tls/libpthread.so.0
> #1 0x0000002a956759cd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
> #2 0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #3 0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #4 0x0000000000000000 in ?? ()
> (gdb)
>
>
> Node 3:
> ======
>
> [root at devsunj ~]# ibv_devinfo
> hca_id: mthca0
> fw_ver: 5.1.400
> node_guid: 0002:c902:0020:ed58
> sys_image_guid: 0002:c902:0020:ed5b
> vendor_id: 0x02c9
> vendor_part_id: 25218
> hw_ver: 0xA0
> board_id: MT_0150000001
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 2
> port_lid: 1
> port_lmc: 0x00
>
> port: 2
> state: PORT_INIT (2)
> max_mtu: 2048 (4)
> active_mtu: 512 (2)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
>
> [root at devsunj ~]#
>
>
>
>
> Hal Rosenstock wrote:
>
> >On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote:
> >
> >
> >>Hal Rosenstock wrote:
> >>
> >>
> >>
> >>>And the two switches are not connected to each other, right ?
> >>>
> >>>
> >>>
> >>>
> >> Yes, the switches are not connected.
> >>
> >>
> >>
> >>>Do you set a different subnet prefix (other than the default on one) ?
> >>>Not sure if this matters yet in OpenIB but it might.
> >>>
> >>>
> >>>
> >>>
> >> I don't know how to set subnet prefix.
> >>
> >>
> >
> >In opensm.opts file:
> >
> ># Subnet prefix used on this subnet
> >subnet_prefix 0xfe80000000000000
> >
> >(that's the default one)
> >
> >
> >
> >> So it may be default one.
> >>
> >>
> >>
> >>>That's the main thread. It's in the following loop:
> >>>
> >>> while( !osm_exit_flag ) {
> >>> if (opt.console)
> >>> osm_console(&osm);
> >>> else
> >>> cl_thread_suspend( 10000 );
> >>>
> >>> if (osm_hup_flag) {
> >>> osm_hup_flag = 0;
> >>> /* a HUP signal should only start a new heavy sweep */
> >>> osm.subn.force_immediate_heavy_sweep = TRUE;
> >>> osm_opensm_sweep( &osm );
> >>> }
> >>>
> >>>What about the other threads ? What are they doing ?
> >>>
> >>>
> >>>
> >>>
> >> Yes. I got this. It was in this loop. I didn't realized there are
> >>other OpenSM threads running. I need to find that out.
> >>
> >>
> >
> >OK.
> >
> >
> >
> >>>I wouldn't expect that given the problem your hitting. The SUBNET UP
> >>>only occurs once the heavy sweep is completed. That's not happening.
> >>>
> >>>-- Hal
> >>>
> >>>
> >>>
> >>>
> >> Is the heavy sweep supposed to happen after the failover ?
> >>
> >>
> >
> >The standby after determining that the master is non responsive will go
> >back to discovering but in your configuration will find no other SM and
> >will go to master. I think it got that far.
> >
> >Once it transitions to master, it does a heavy sweep to configure the
> >subnet. Something is stopping that from completing. I'm not sure what is
> >going wrong.
> >
> >
> >
> >> Is there any documentaion on the OpenSM architecture and design ?
> >>
> >>
> >
> >Just the code AFAIK. You can read the SM and SA sections of IBA volume 1
> >for what an SM is supposed to do.
> >
> >-- Hal
> >
> >
> >
> >> VBabu
> >>
> >>
> >
> >
> >
More information about the general
mailing list