[ofa-general] opensm dumps core when using LASH for routing

Sasha Khapyorsky sashak at voltaire.com
Sun Jan 13 00:40:01 PST 2008


Hi Max,

On 22:07 Fri 11 Jan     , Max Matveev wrote:
> 
> I've got opensm 3.0.3 from OFED 1.2 dying on startup when using LASH
> for routing.  Here is the trace:
> 
> #0  0x0000000000459abf in get_lash_id (p_sw=0x5fbf40) at osm_ucast_lash.c:1124
> #1  0x000000000045a704 in osm_get_lash_sl (p_osm=0x7fffa0aba4d0,
>     p_src_port=0x2aab06971400, p_dst_port=0x61ade0) at
>     osm_ucast_lash.c:1450
> #2  0x000000000042e80d in __osm_pr_rcv_get_path_parms(p_rcv=0x7fffa0abba80,
>     p_pr=0x2aab0661add0, p_src_port=0x2aab06971400,
>     p_dest_port=0x61ade0,
>     dest_lid_ho=2, comp_mask=580964351930793984, p_parms=0x649eef20)
>     at osm_sa_path_record.c:685
> #3  0x000000000042f02b in __osm_pr_rcv_get_lid_pair_path (
>     p_rcv=0x7fffa0abba80, p_pr=0x2aab0661add0,
>     p_src_port=0x2aab06971400,
>     p_dest_port=0x61ade0, p_dgid=0x649ef0a0, src_lid_ho=1,
>     dest_lid_ho=2,
>     comp_mask=580964351930793984, preference=0 '\0')
>     at osm_sa_path_record.c:852
> #4  0x000000000042f5d6 in __osm_pr_rcv_get_port_pair_paths (
>     p_rcv=0x7fffa0abba80, p_madw=0x6ecbb0, p_req_port=0x2aab06971400,
>     p_src_port=0x2aab06971400, p_dest_port=0x61ade0,
>     p_dgid=0x649ef0a0,
>     comp_mask=580964351930793984, p_list=0x649ef0b0)
>     at osm_sa_path_record.c:1072
> #5  0x000000000042fdc5 in __osm_pr_rcv_process_half(p_rcv=0x7fffa0abba80,
>     p_madw=0x6ecbb0, requester_port=0x2aab06971400,
>     p_src_port=0x2aab06971400,
>     p_dest_port=0x0, p_dgid=0x649ef0a0, comp_mask=580964351930793984,
>     p_list=0x649ef0b0) at osm_sa_path_record.c:1437
> #6  0x0000000000430c6f in osm_pr_rcv_process (context=0x7fffa0abba80,
>     data=0x6ecbb0) at osm_sa_path_record.c:2003
> #7  0x00002b110a54ef57 in __cl_disp_worker (context=0x7fffa0abcb30)
>     at cl_dispatcher.c:102
> #8  0x00002b110a5563b7 in __cl_thread_pool_routine(context=0x7fffa0abcba8)
>     at cl_threadpool.c:74
> #9  0x00002b110a55620a in __cl_thread_wrapper (arg=0x5a4a40) at cl_thread.c:58
> #10 0x00002b110a21b143 in start_thread () from /lib64/libpthread.so.0
> #11 0x00002b110a82774d in clone () from /lib64/libc.so.6
> #12 0x0000000000000000 in ?? ()
> 
> This is the switch:
> 
> (gdb) p *( osm_switch_t *)0x5fbf40
> $3 = {map_item = {pool_item = {list_item = {p_next = 0x2aab065b4ed0,
>         p_prev = 0x2aab069850f0}}, p_left = 0x2aab069850f0,
>     p_right = 0x2aab065b4ed0, p_up = 0x2aab06587cd0, color =
>     CL_MAP_BLACK,
>     key = 17582052945261297672}, p_node = 0x608c80, switch_info = {
>     lin_cap = 192, rand_cap = 0, mcast_cap = 4, lin_top = 32769,
>     def_port = 0 '\0', def_mcast_pri_port = 0 '\0',
>     def_mcast_not_port = 0 '\0', life_state = 144 '\220',
>     lids_per_port = 0,
>     enforce_cap = 8192, flags = 240 ''}, max_lid_ho = 0,
>   num_ports = 25 '\031', num_hops = 0, hops = 0x0, p_prof = 0x5fbff0,
>   fwd_tbl = {p_rnd_tbl = 0x0, p_lin_tbl = 0x7dd0f0}, mcast_tbl = {
>     num_ports = 25 '\031', max_position = 1 '\001', max_block = 31,
>     max_block_in_use = -1, num_entries = 1024, max_mlid_ho = 50176,
>     p_mask_tbl = 0x7e9100}, discovery_count = 3, priv = 0x0}
> 
> As you can see the priv pointer is NULL get_lash_id() follows it and
> dies.
> 
> There is an obvious fix - simply check for priv in osm_get_lash_sl()
> and return OSM_DEFAULT_SL: it already does it when checking for src_id
> but not for dst_id but I'm not sure it's the right fix because I
> cannot quite understand how priv got to be NULL - it was reset in
> lash_cleanup()

I suspect that the failure scenario is different. This switch was just
connected/discovered by OpenSM (it has hops = 0x0 yet - this indicates
that it does not pass lid matrix generation stage yet) and it still be
uninitialized by LASH. If it is really so checking ->priv for NULL looks
like valid fix.

Is this reproducible failure?

Sasha

> but I don't see any threads which are inside
> discover_network_properties() and I would've thought that when opensm
> gets out of there, all switches must be initialized properly.
> 
> I'm also not sure who initiated PATH_RECORD query - it does not look
> like opensm would do it to itself yet the requestor_port was on the
> name HCA. It could be another process running on the same host for
> what I know.
> 
> max
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list