[openib-general] Unreliable OpemSM failover

Venkatesh Babu venkatesh.babu at 3leafnetworks.com
Fri Dec 8 17:03:30 PST 2006


Now I hit another instance of the problem. Now I have more information.

Node1:
======

[root at vortex3l-71 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.1.400
        node_guid:                      0050:4501:4a5a:0000
        sys_image_guid:                 0050:4501:4a5a:0003
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       ARM0020000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               7
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 4
                        port_lid:               4
                        port_lmc:               0x00

[root at vortex3l-71 ~]# ps -aux | grep open
Warning: bad syntax, perhaps a bogus '-'? See 
/usr/share/doc/procps-3.2.3/FAQ
root      6774  0.0  0.0 92844 1684 ?        Sl   Dec07   0:06 
/usr/local/ofed/bin/opensm -g 0x005045014a5a0001 -p 1 -s 10 -u -f 
/var/log/opensm1.log
root     21537  0.0  0.4 64556 9276 ttyS0    S+   16:48   0:00 gdb 
/usr/local/ofed/bin/opensm 6787
root      6787  0.0  0.0 92844 1728 ?        Tl   Dec07   0:05 
/usr/local/ofed/bin/opensm -g 0x005045014a5a0002 -p 1 -s 10 -u -f 
/var/log/opensm2.log
root     22566  0.0  0.0 51072  692 pts/0    S+   16:53   0:00 grep open
[root at vortex3l-71 ~]# tail /var/log/opensm2.log

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

Dec 07 11:29:14 623895 [45007960] -> umad_receiver: ERR 5404: recv error 
on MAD sized umad (Interrupted system call)
Dec 07 11:29:14 625421 [0000] -> Exiting SM

[root at vortex3l-71 ~]#
[root at vortex3l-71 ~]# gdb /usr/local/ofed/bin/opensm 6787
GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Attaching to program: /usr/local/ofed/bin/opensm, process 6787
Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
Reading symbols from /lib64/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 182899544416 (LWP 6787)]
[New Thread 1157658976 (LWP 6797)]
[New Thread 1147169120 (LWP 6796)]
[New Thread 1136679264 (LWP 6795)]
[New Thread 1126189408 (LWP 6794)]
[New Thread 1115699552 (LWP 6793)]
[New Thread 1105209696 (LWP 6792)]
[New Thread 1094719840 (LWP 6791)]
[New Thread 1084229984 (LWP 6789)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x0000000000405b71 in main ()
(gdb) info threads
  9 Thread 1084229984 (LWP 6789)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  8 Thread 1094719840 (LWP 6791)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  7 Thread 1105209696 (LWP 6792)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  6 Thread 1115699552 (LWP 6793)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  5 Thread 1126189408 (LWP 6794)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  4 Thread 1136679264 (LWP 6795)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  3 Thread 1147169120 (LWP 6796)  0x0000003858c08acf in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  2 Thread 1157658976 (LWP 6797)  0x0000003857fbcd22 in poll ()
   from /lib64/tls/libc.so.6
  1 Thread 182899544416 (LWP 6787)  0x0000003857f8ed65 in 
__nanosleep_nocancel
    () from /lib64/tls/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 182899544416 (LWP 6787))]#0  
0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x0000000000405b71 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 1157658976 (LWP 6797))]#0  
0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
#1  0x0000002a9588d90d in dev_poll (fd=Variable "fd" is not available.
) at src/umad.c:775
#2  0x0000002a9588da2d in umad_recv (portid=Variable "portid" is not 
available.
) at src/umad.c:805
#3  0x0000002a9578367b in umad_receiver (p_ptr=0x5c2d50)
    at osm_vendor_ibumad.c:266
#4  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at 
cl_thread.c:61
#5  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#7  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 1147169120 (LWP 6796))]#0  
0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
    wait_us=10000000, interruptible=1) at cl_event.c:181
#2  0x00000000004362dc in __osm_sm_sweeper ()
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (Thread 1136679264 (LWP 6795))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x000000000044d771 in __osm_vl15_poller ()
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (Thread 1126189408 (LWP 6794))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 6
[Switching to thread 6 (Thread 1115699552 (LWP 6793))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 7
[Switching to thread 7 (Thread 1105209696 (LWP 6792))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 8
[Switching to thread 8 (Thread 1094719840 (LWP 6791))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 9
[Switching to thread 9 (Thread 1084229984 (LWP 6789))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a95675991 in __cl_timer_prov_cb (context=0x0) at cl_timer.c:157
#2  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#3  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#4  0x0000000000000000 in ?? ()
(gdb)


Node 2:
======

[root at localhost ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.1.400
        node_guid:                      0050:4501:4a9e:0000
        sys_image_guid:                 0050:4501:4a9e:0003
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       ARM0020000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               2
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 4
                        port_lid:               2
                        port_lmc:               0x00

[root at localhost ~]# ps -aux | grep open
Warning: bad syntax, perhaps a bogus '-'? See 
/usr/share/doc/procps-3.2.3/FAQ
root      6854  0.0  0.0 92844 1648 ?        Sl   16:12   0:00 
/usr/local/ofed/bin/opensm -g 0x005045014a9e0001 -p 8 -s 10 -u -f 
/var/log/opensm1.log
root     14005  0.0  0.4 64632 9312 ttyS0    S+   16:46   0:00 gdb 
/var/log/opensm2.log 6867
root      6867  0.0  0.0 92844 1536 ?        Tl   16:12   0:00 
/usr/local/ofed/bin/opensm -g 0x005045014a9e0002 -p 8 -s 10 -u -f 
/var/log/opensm2.log
root     16223  0.0  0.0 51060  680 pts/0    S+   16:56   0:00 grep open
[root at localhost ~]# tail /var/log/opensm2.log
Dec 07 05:15:07 675863 [41401960] -> osm_subn_set_up_down_min_hop_table: 
BFS through all port guids in the subnet ]
Dec 07 05:15:07 675898 [41401960] -> osm_ucast_mgr_process: Min Hop 
Tables configured on all switches
Dec 07 05:15:07 682095 [43204960] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25: 
Received an invalid delete request on MGID: 0xff12401bffff0000 : 
0x00000000ffffffff for PortGID: 0xfe80000000000000 : 0x0050450148ba0002
Dec 07 05:15:07 677004 [0000] -> SUBNET UP

Dec 07 05:15:09 598888 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
method = SubnAdmSet, scope_state = 0x1, component mask = 
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
0xffffffffffff0000 : 0x032e1480ffffffff from port 0x005045014a9e0002
Dec 07 07:26:17 429099 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
method = SubnAdmSet, scope_state = 0x1, component mask = 
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
0xffffffffffff0000 : 0x032e1480ffffffff from port 0x0050450148ba0002
Dec 07 07:26:18 429309 [41E02960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
method = SubnAdmSet, scope_state = 0x1, component mask = 
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
Dec 07 11:29:03 817752 [0000] -> Exiting SM

[root at localhost ~]#
[root at localhost ~]# gdb /var/log/opensm2.log 6867
GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as 
"x86_64-redhat-linux-gnu"..."/var/log/opensm2.log": not in executable 
format: File format not recognized

Attaching to process 6867
Reading symbols from /usr/local/ofed/bin/opensm...(no debugging symbols 
found)...done.
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
Reading symbols from /lib64/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 182899548512 (LWP 6867)]
[New Thread 1157658976 (LWP 6884)]
[New Thread 1147169120 (LWP 6883)]
[New Thread 1136679264 (LWP 6882)]
[New Thread 1126189408 (LWP 6881)]
[New Thread 1115699552 (LWP 6880)]
[New Thread 1105209696 (LWP 6879)]
[New Thread 1094719840 (LWP 6878)]
[New Thread 1084229984 (LWP 6869)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00000032eec8ed65 in __nanosleep_nocancel ()
   from /lib64/tls/libc.so.6
(gdb) bt
#0  0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x0000000000405b71 in main ()
(gdb) info threads
  9 Thread 1084229984 (LWP 6869)  0x00000032ef908acf in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  8 Thread 1094719840 (LWP 6878)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  7 Thread 1105209696 (LWP 6879)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  6 Thread 1115699552 (LWP 6880)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  5 Thread 1126189408 (LWP 6881)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  4 Thread 1136679264 (LWP 6882)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  3 Thread 1147169120 (LWP 6883)  0x00000032ef908acf in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  2 Thread 1157658976 (LWP 6884)  0x00000032eecbcd22 in poll ()
   from /lib64/tls/libc.so.6
  1 Thread 182899548512 (LWP 6867)  0x00000032eec8ed65 in 
__nanosleep_nocancel
    () from /lib64/tls/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 182899548512 (LWP 6867))]#0  
0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x0000000000405b71 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 1157658976 (LWP 6884))]#0  
0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
#1  0x0000002a9588e90d in dev_poll (fd=Variable "fd" is not available.
) at src/umad.c:775
#2  0x0000002a9588ea2d in umad_recv (portid=Variable "portid" is not 
available.
) at src/umad.c:805
#3  0x0000002a9578467b in umad_receiver (p_ptr=0x5c2d50)
    at osm_vendor_ibumad.c:266
#4  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at 
cl_thread.c:61
#5  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#6  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#7  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 1147169120 (LWP 6883))]#0  
0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
    wait_us=10000000, interruptible=1) at cl_event.c:181
#2  0x00000000004362dc in __osm_sm_sweeper ()
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (Thread 1136679264 (LWP 6882))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x000000000044d771 in __osm_vl15_poller ()
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (Thread 1126189408 (LWP 6881))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 6
[Switching to thread 6 (Thread 1115699552 (LWP 6880))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 7
[Switching to thread 7 (Thread 1105209696 (LWP 6879))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 8
[Switching to thread 8 (Thread 1094719840 (LWP 6878))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 9
[Switching to thread 9 (Thread 1084229984 (LWP 6869))]#0  
0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a956759cd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
#2  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#3  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#4  0x0000000000000000 in ?? ()
(gdb)


Node 3:
======

[root at devsunj ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.1.400
        node_guid:                      0002:c902:0020:ed58
        sys_image_guid:                 0002:c902:0020:ed5b
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       MT_0150000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               1
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

[root at devsunj ~]#




Hal Rosenstock wrote:

>On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote:
>  
>
>>Hal Rosenstock wrote:
>>
>>    
>>
>>>And the two switches are not connected to each other, right ?
>>> 
>>>
>>>      
>>>
>>  Yes, the switches are not connected.
>>
>>    
>>
>>>Do you set a different subnet prefix (other than the default on one) ?
>>>Not sure if this matters yet in OpenIB but it might.
>>> 
>>>
>>>      
>>>
>> I don't know how to set subnet prefix.
>>    
>>
>
>In opensm.opts file:
>
># Subnet prefix used on this subnet
>subnet_prefix 0xfe80000000000000
>
>(that's the default one)
>
>  
>
>> So it may be default one.
>>
>>    
>>
>>>That's the main thread. It's in the following loop:
>>>
>>>   while( !osm_exit_flag ) {
>>>     if (opt.console)
>>>       osm_console(&osm);
>>>     else
>>>       cl_thread_suspend( 10000 );
>>>
>>>     if (osm_hup_flag) {
>>>       osm_hup_flag = 0;
>>>       /* a HUP signal should only start a new heavy sweep */
>>>       osm.subn.force_immediate_heavy_sweep = TRUE;
>>>       osm_opensm_sweep( &osm );
>>>     }
>>>
>>>What about the other threads ? What are they doing ?
>>> 
>>>
>>>      
>>>
>>  Yes. I got this. It was in this loop. I didn't realized there are 
>>other OpenSM threads running. I need to find that out.
>>    
>>
>
>OK.
>
>  
>
>>>I wouldn't expect that given the problem your hitting. The SUBNET UP
>>>only occurs once the heavy sweep is completed. That's not happening.
>>>
>>>-- Hal
>>> 
>>>
>>>      
>>>
>>   Is the heavy sweep supposed to happen after the failover ?
>>    
>>
>
>The standby after determining that the master is non responsive will go
>back to discovering but in your configuration will find no other SM and
>will go to master. I think it got that far.
>
>Once it transitions to master, it does a heavy sweep to configure the
>subnet. Something is stopping that from completing. I'm not sure what is
>going wrong.
>
>  
>
>>   Is there any documentaion on the OpenSM architecture and design ?
>>    
>>
>
>Just the code AFAIK. You can read the SM and SA sections of IBA volume 1
>for what an SM is supposed to do.
>
>-- Hal
>
>  
>
>> VBabu
>>    
>>
>
>  
>




More information about the general mailing list