[openib-general] Unreliable OpemSM failover

Hal Rosenstock halr at voltaire.com
Fri Dec 8 17:38:48 PST 2006


On Fri, 2006-12-08 at 20:03, Venkatesh Babu wrote:
> Now I hit another instance of the problem. Now I have more information.

Was this the same scenario or something different ?

> Node1:
> ======
> 
> [root at vortex3l-71 ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.1.400
>         node_guid:                      0050:4501:4a5a:0000

So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
that right ?

>         sys_image_guid:                 0050:4501:4a5a:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 2
>                         port_lid:               7
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 4
>                         port_lid:               4
>                         port_lmc:               0x00
> 
> [root at vortex3l-71 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See 
> /usr/share/doc/procps-3.2.3/FAQ
> root      6774  0.0  0.0 92844 1684 ?        Sl   Dec07   0:06 
> /usr/local/ofed/bin/opensm -g 0x005045014a5a0001 -p 1 -s 10 -u -f 
> /var/log/opensm1.log
> root     21537  0.0  0.4 64556 9276 ttyS0    S+   16:48   0:00 gdb 
> /usr/local/ofed/bin/opensm 6787
> root      6787  0.0  0.0 92844 1728 ?        Tl   Dec07   0:05 
> /usr/local/ofed/bin/opensm -g 0x005045014a5a0002 -p 1 -s 10 -u -f 
> /var/log/opensm2.log
> root     22566  0.0  0.0 51072  692 pts/0    S+   16:53   0:00 grep open
> [root at vortex3l-71 ~]# tail /var/log/opensm2.log
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 
> 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 
> 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 
> 00 00 00
> 
> Dec 07 11:29:14 623895 [45007960] -> umad_receiver: ERR 5404: recv error 
> on MAD sized umad (Interrupted system call)
> Dec 07 11:29:14 625421 [0000] -> Exiting SM

Does this correspond to when node 2 SM goes down, SM comes up, or
something else ? 

Not sure why OpenSM decides to exit (due to this error which should be
recoverable). It then fails to exit (hangs) as the other threads are not
terminated. 

Is osm_exit_flag set ? I presume it is but would like verification.
What are the thread_state values of the various threads ?

> [root at vortex3l-71 ~]#
> [root at vortex3l-71 ~]# gdb /usr/local/ofed/bin/opensm 6787
> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (no debugging symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> 
> Attaching to program: /usr/local/ofed/bin/opensm, process 6787
> Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
> Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182899544416 (LWP 6787)]
> [New Thread 1157658976 (LWP 6797)]
> [New Thread 1147169120 (LWP 6796)]
> [New Thread 1136679264 (LWP 6795)]
> [New Thread 1126189408 (LWP 6794)]
> [New Thread 1115699552 (LWP 6793)]
> [New Thread 1105209696 (LWP 6792)]
> [New Thread 1094719840 (LWP 6791)]
> [New Thread 1084229984 (LWP 6789)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
> Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
> Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x0000000000405b71 in main ()
> (gdb) info threads
>   9 Thread 1084229984 (LWP 6789)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   8 Thread 1094719840 (LWP 6791)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   7 Thread 1105209696 (LWP 6792)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   6 Thread 1115699552 (LWP 6793)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   5 Thread 1126189408 (LWP 6794)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   4 Thread 1136679264 (LWP 6795)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   3 Thread 1147169120 (LWP 6796)  0x0000003858c08acf in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   2 Thread 1157658976 (LWP 6797)  0x0000003857fbcd22 in poll ()
>    from /lib64/tls/libc.so.6
>   1 Thread 182899544416 (LWP 6787)  0x0000003857f8ed65 in 
> __nanosleep_nocancel
>     () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182899544416 (LWP 6787))]#0  
> 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x0000000000405b71 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 6797))]#0  
> 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
> #1  0x0000002a9588d90d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2  0x0000002a9588da2d in umad_recv (portid=Variable "portid" is not 
> available.
> ) at src/umad.c:805
> #3  0x0000002a9578367b in umad_receiver (p_ptr=0x5c2d50)
>     at osm_vendor_ibumad.c:266
> #4  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at 
> cl_thread.c:61
> #5  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #6  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #7  0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 6796))]#0  
> 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
>     wait_us=10000000, interruptible=1) at cl_event.c:181
> #2  0x00000000004362dc in __osm_sm_sweeper ()
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 6795))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x000000000044d771 in __osm_vl15_poller ()
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 6794))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 6793))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 6792))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 6791))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 6789))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95675991 in __cl_timer_prov_cb (context=0x0) at cl_timer.c:157
> #2  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #3  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #4  0x0000000000000000 in ?? ()
> (gdb)
> 
> 
> Node 2:
> ======

Is this when node 2 comes back up and SM is restarted on both ports or
is it after the SM is stopped on port 2 ?

> [root at localhost ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.1.400
>         node_guid:                      0050:4501:4a9e:0000
>         sys_image_guid:                 0050:4501:4a9e:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 2
>                         port_lid:               2
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_INIT (2)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 4

This port still points at the SM on node 1, right ?

>                         port_lid:               2
>                         port_lmc:               0x00
> 
> [root at localhost ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See 
> /usr/share/doc/procps-3.2.3/FAQ
> root      6854  0.0  0.0 92844 1648 ?        Sl   16:12   0:00 
> /usr/local/ofed/bin/opensm -g 0x005045014a9e0001 -p 8 -s 10 -u -f 
> /var/log/opensm1.log
> root     14005  0.0  0.4 64632 9312 ttyS0    S+   16:46   0:00 gdb 
> /var/log/opensm2.log 6867
> root      6867  0.0  0.0 92844 1536 ?        Tl   16:12   0:00 
> /usr/local/ofed/bin/opensm -g 0x005045014a9e0002 -p 8 -s 10 -u -f 
> /var/log/opensm2.log
> root     16223  0.0  0.0 51060  680 pts/0    S+   16:56   0:00 grep open
> [root at localhost ~]# tail /var/log/opensm2.log
> Dec 07 05:15:07 675863 [41401960] -> osm_subn_set_up_down_min_hop_table: 
> BFS through all port guids in the subnet ]
> Dec 07 05:15:07 675898 [41401960] -> osm_ucast_mgr_process: Min Hop 
> Tables configured on all switches
> Dec 07 05:15:07 682095 [43204960] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25: 
> Received an invalid delete request on MGID: 0xff12401bffff0000 : 
> 0x00000000ffffffff for PortGID: 0xfe80000000000000 : 0x0050450148ba0002
> Dec 07 05:15:07 677004 [0000] -> SUBNET UP
> 
> Dec 07 05:15:09 598888 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
> method = SubnAdmSet, scope_state = 0x1, component mask = 
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
> 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x005045014a9e0002
> Dec 07 07:26:17 429099 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
> method = SubnAdmSet, scope_state = 0x1, component mask = 
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
> 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x0050450148ba0002
> Dec 07 07:26:18 429309 [41E02960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
> method = SubnAdmSet, scope_state = 0x1, component mask = 
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
> 0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
> Dec 07 11:29:03 817752 [0000] -> Exiting SM

You stopped this SM, right ?

> [root at localhost ~]#
> [root at localhost ~]# gdb /var/log/opensm2.log 6867

Why gdb this node's SM ? I'm not following you.

Should point at executable not log.

> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as 
> "x86_64-redhat-linux-gnu"..."/var/log/opensm2.log": not in executable 
> format: File format not recognized
> 
> Attaching to process 6867
> Reading symbols from /usr/local/ofed/bin/opensm...(no debugging symbols 
> found)...done.
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
> Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182899548512 (LWP 6867)]
> [New Thread 1157658976 (LWP 6884)]
> [New Thread 1147169120 (LWP 6883)]
> [New Thread 1136679264 (LWP 6882)]
> [New Thread 1126189408 (LWP 6881)]
> [New Thread 1115699552 (LWP 6880)]
> [New Thread 1105209696 (LWP 6879)]
> [New Thread 1094719840 (LWP 6878)]
> [New Thread 1084229984 (LWP 6869)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
> Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
> Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x00000032eec8ed65 in __nanosleep_nocancel ()
>    from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x0000000000405b71 in main ()
> (gdb) info threads
>   9 Thread 1084229984 (LWP 6869)  0x00000032ef908acf in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   8 Thread 1094719840 (LWP 6878)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   7 Thread 1105209696 (LWP 6879)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   6 Thread 1115699552 (LWP 6880)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   5 Thread 1126189408 (LWP 6881)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   4 Thread 1136679264 (LWP 6882)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   3 Thread 1147169120 (LWP 6883)  0x00000032ef908acf in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   2 Thread 1157658976 (LWP 6884)  0x00000032eecbcd22 in poll ()
>    from /lib64/tls/libc.so.6
>   1 Thread 182899548512 (LWP 6867)  0x00000032eec8ed65 in 
> __nanosleep_nocancel
>     () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182899548512 (LWP 6867))]#0  
> 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x0000000000405b71 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 6884))]#0  
> 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
> #1  0x0000002a9588e90d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2  0x0000002a9588ea2d in umad_recv (portid=Variable "portid" is not 
> available.
> ) at src/umad.c:805
> #3  0x0000002a9578467b in umad_receiver (p_ptr=0x5c2d50)
>     at osm_vendor_ibumad.c:266
> #4  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at 
> cl_thread.c:61
> #5  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #6  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #7  0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 6883))]#0  
> 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
>     wait_us=10000000, interruptible=1) at cl_event.c:181
> #2  0x00000000004362dc in __osm_sm_sweeper ()
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 6882))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x000000000044d771 in __osm_vl15_poller ()
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 6881))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 6880))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 6879))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 6878))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 6869))]#0  
> 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a956759cd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
> #2  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #3  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #4  0x0000000000000000 in ?? ()
> (gdb)
> 
> 
> Node 3:
> ======
> 
> [root at devsunj ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.1.400
>         node_guid:                      0002:c902:0020:ed58
>         sys_image_guid:                 0002:c902:0020:ed5b
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       MT_0150000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 2
>                         port_lid:               1
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_INIT (2)
>                         max_mtu:                2048 (4)
>                         active_mtu:             512 (2)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
> 
> [root at devsunj ~]#
> 
> 
> 
> 
> Hal Rosenstock wrote:
> 
> >On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote:
> >  
> >
> >>Hal Rosenstock wrote:
> >>
> >>    
> >>
> >>>And the two switches are not connected to each other, right ?
> >>> 
> >>>
> >>>      
> >>>
> >>  Yes, the switches are not connected.
> >>
> >>    
> >>
> >>>Do you set a different subnet prefix (other than the default on one) ?
> >>>Not sure if this matters yet in OpenIB but it might.
> >>> 
> >>>
> >>>      
> >>>
> >> I don't know how to set subnet prefix.
> >>    
> >>
> >
> >In opensm.opts file:
> >
> ># Subnet prefix used on this subnet
> >subnet_prefix 0xfe80000000000000
> >
> >(that's the default one)
> >
> >  
> >
> >> So it may be default one.
> >>
> >>    
> >>
> >>>That's the main thread. It's in the following loop:
> >>>
> >>>   while( !osm_exit_flag ) {
> >>>     if (opt.console)
> >>>       osm_console(&osm);
> >>>     else
> >>>       cl_thread_suspend( 10000 );
> >>>
> >>>     if (osm_hup_flag) {
> >>>       osm_hup_flag = 0;
> >>>       /* a HUP signal should only start a new heavy sweep */
> >>>       osm.subn.force_immediate_heavy_sweep = TRUE;
> >>>       osm_opensm_sweep( &osm );
> >>>     }
> >>>
> >>>What about the other threads ? What are they doing ?
> >>> 
> >>>
> >>>      
> >>>
> >>  Yes. I got this. It was in this loop. I didn't realized there are 
> >>other OpenSM threads running. I need to find that out.
> >>    
> >>
> >
> >OK.
> >
> >  
> >
> >>>I wouldn't expect that given the problem your hitting. The SUBNET UP
> >>>only occurs once the heavy sweep is completed. That's not happening.
> >>>
> >>>-- Hal
> >>> 
> >>>
> >>>      
> >>>
> >>   Is the heavy sweep supposed to happen after the failover ?
> >>    
> >>
> >
> >The standby after determining that the master is non responsive will go
> >back to discovering but in your configuration will find no other SM and
> >will go to master. I think it got that far.
> >
> >Once it transitions to master, it does a heavy sweep to configure the
> >subnet. Something is stopping that from completing. I'm not sure what is
> >going wrong.
> >
> >  
> >
> >>   Is there any documentaion on the OpenSM architecture and design ?
> >>    
> >>
> >
> >Just the code AFAIK. You can read the SM and SA sections of IBA volume 1
> >for what an SM is supposed to do.
> >
> >-- Hal
> >
> >  
> >
> >> VBabu
> >>    
> >>
> >
> >  
> >





More information about the general mailing list