[openib-general] opensm segfault?

Hal Rosenstock halr at voltaire.com
Tue May 16 13:00:32 PDT 2006


Hi Troy,

On Tue, 2006-05-16 at 15:41, Troy Benjegerdes wrote:
> I got this after an indeterminate amount of time running opensm..
> 
> 
> (gdb) bt
> #0  0x00002b90b0dbebf3 in cl_memcpy (p_dest=0x2aaaaac88850, p_src=0x0,
                                                              ^^^^^^^^^
This is the problem. Not sure why yet.

>     count=64) at cl_memory_osd.c:87
> #1  0x0000000000415053 in osm_pkey_tbl_sync_new_blocks (
>     p_pkey_tbl=0x2aaaaad99228) at osm_pkey.c:127
> #2  0x0000000000416687 in osm_pkey_mgr_process (p_osm=0x580e40)
>     at osm_pkey_mgr.c:407
> #3  0x000000000043bb22 in osm_state_mgr_process (p_mgr=0x581ad8,
> signal=3)
>     at osm_state_mgr.c:2243
> #4  0x000000000043c88f in __osm_state_mgr_ctrl_disp_callback (
>     context=0x5819e8, p_data=0x3) at osm_state_mgr_ctrl.c:70
> #5  0x00002b90b0db9437 in __cl_disp_worker (context=0x5831f0)
>     at cl_dispatcher.c:108
> #6  0x00002b90b0dc1ca3 in __cl_thread_pool_routine (context=0x583268)
>     at cl_threadpool.c:78
> #7  0x00002b90b0dc1ae2 in __cl_thread_wrapper (arg=0x584750) at
> cl_thread.c:61
> #8  0x00002b90b0fe3b1c in start_thread () from /lib/libpthread.so.0
> #9  0x00002b90b12c8273 in clone () from /lib/libc.so.6
> 
> 
> 
> And why the heck is "cl_memcpy" just a call to 'memcpy' anyway? This
> just seems like excessive uneeded abstraction.

It's part of the component library, which is an OS abstraction layer.

> I'm running opensm from subversion rev 7091.. 
> 
> May 10 16:27:53 145969 [0000] -> OpenSM Rev:openib-1.2.0 OpenIB svn
> 6251:7091M
> 
> the only local changes are as follows:
> 
> troy at opteron1:/usr/src/openib-src/userspace/management$ svn diff
> Index: osm/opensm/osm_port_info_rcv.c
> ===================================================================
> --- osm/opensm/osm_port_info_rcv.c      (revision 7091)
> +++ osm/opensm/osm_port_info_rcv.c      (working copy)
> @@ -469,9 +469,14 @@
>        goto Exit;
>      }
> 
> +#if 0
>      /* Check for IBM eHCA firmware defect in reporting partition
>  * enforcement cap */
>      if (cl_ntoh32(ib_node_info_get_vendor_id(&p_node->node_info)) ==
> IBM_VENDOR_ID)
>        p_switch->switch_info.enforce_cap = 0;
> +#endif
> +    /* Check for busted divergenet switch on ameslab network */
> +    if (cl_ntoh64(p_node->node_info.node_guid) == 0x00084e0000000152)
> +           p_switch->switch_info.enforce_cap = 0;
> 
>      /* Bail out if this is a switch with no partition enforcement
>  * capability */
>      if (cl_ntoh16(p_switch->switch_info.enforce_cap) == 0)

Yes, that's fine.

-- Hal




More information about the general mailing list