[openfabrics-ewg] Re: [openib-general] OpenSM segmentation fault on RC5

Hal Rosenstock halr at voltaire.com
Fri May 26 10:43:23 PDT 2006


Hi Don,

On Fri, 2006-05-26 at 13:34, Don.Albert at Bull.com wrote:
> Hal,
> 
> > Hi again Paul,
> 
> Since your last message was addressed to Paul, and you said my problem
> was completely different, I don't know if a backtrace would help in my
> case, but here it is anyway, just in case. (See below.)
> 
> > 
> > Would you rebuild OpenSM with debug:
> > ./configure --enable-debug && make clean && make && make install
> > 
> > and then run opensm under gdb and provide the backtrace after the
> > failure?
> > 
> > Thanks.
> > 
> > -- Hal
> 
> I can also rebuild with --enable_debug if it would be useful.
> 
>         -Don Albert-
> 
> Backtrace of segfault in SM:
> 
> [koa] (ib) ib> gdb /usr/local/ofed/bin/opensm
> GNU gdb Red Hat Linux (6.3.0.0-1.96rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and
> you are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for
> details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging
> symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> 
> (gdb) run
> Starting program: /usr/local/ofed/bin/opensm
> [Thread debugging using libthread_db enabled]
> [New Thread 47576487182656 (LWP 8030)]
> [New Thread 1082132832 (LWP 8033)]
> -------------------------------------------------
> OpenSM Rev:openib-1.2.0
> Based on OpenIB svn Exported revision
> Command Line Arguments:
>  Log File: /var/log/osm.log
> -------------------------------------------------
> OpenSM Rev:openib-1.2.0 OpenIB svn Exported revision
> 
> [New Thread 1090525536 (LWP 8034)]
> [New Thread 1098918240 (LWP 8035)]
> [New Thread 1107310944 (LWP 8036)]
> [New Thread 1115703648 (LWP 8037)]
> [New Thread 1124096352 (LWP 8038)]
> [New Thread 1132489056 (LWP 8039)]
> Using default GUID 0x2c90200216dc5
> [New Thread 1140881760 (LWP 8040)]
> Entering MASTER state
> 
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1090525536 (LWP 8034)]
> 0x000000000040b5bb in osm_physp_is_valid ()
> (gdb) bt
> #0  0x000000000040b5bb in osm_physp_is_valid ()
> #1  0x000000000040b555 in __osm_lid_mgr_set_remote_pi_state_to_init ()
> #2  0x000000000040babf in __osm_lid_mgr_set_physp_pi ()
> #3  0x000000000040c065 in __osm_lid_mgr_process_our_sm_node ()
> #4  0x000000000040c151 in osm_lid_mgr_process_sm ()
> #5  0x000000000043a2b9 in osm_state_mgr_process ()
> #6  0x000000000043aefc in __osm_state_mgr_ctrl_disp_callback ()
> #7  0x00002b454359db27 in __cl_disp_worker (context=0x57ca20) at
> cl_dispatcher.c:108
> #8  0x00002b45435a6025 in __cl_thread_pool_routine (context=0x57ca98)
> at cl_threadpool.c:78
> #9  0x00002b45435a5e6e in __cl_thread_wrapper (arg=0x57d7d0) at
> cl_thread.c:61
> #10 0x0000003a80f0610a in start_thread () from
> /lib64/tls/libpthread.so.0
> #11 0x0000003a806c6003 in clone () from /lib64/tls/libc.so.6
> #12 0x0000000000000000 in ?? ()
> (gdb)

Yes, that is very useful. I had been working on trying to come up with
what the problem was but this narrows it down to something I was
thinking might be going on.

It looks like you are running back to back HCAs, right ?

It also looks to me like your remote (in terms of OpenSM) CA node is not
responding to SMA requests like SubnGet NodeInfo yet the link is active.
Can you describe what state that node is in (what modules are loaded,
etc.) ? Can you do an ibstat/ibstatus on that node ?

Can you try this patch to see if it gets you further and let me know ?
Note that this is just a potential workaround right now.

Thanks.

-- Hal

Index: opensm/osm_lid_mgr.c
===================================================================
--- opensm/osm_lid_mgr.c	(revision 7412)
+++ opensm/osm_lid_mgr.c	(working copy)
@@ -932,6 +932,9 @@ __osm_lid_mgr_set_remote_pi_state_to_ini
 
   CL_ASSERT(p_rem_physp);
 
+  if ( p_rem_physp == NULL )
+    return;
+
   if (osm_physp_is_valid( p_rem_physp ))
   {
     p_pi = osm_physp_get_port_info_ptr( p_rem_physp );










More information about the ewg mailing list