[ofa-general] opensm dumps core when using LASH for routing
Sasha Khapyorsky
sashak at voltaire.com
Sun Jan 13 00:40:01 PST 2008
Hi Max,
On 22:07 Fri 11 Jan , Max Matveev wrote:
>
> I've got opensm 3.0.3 from OFED 1.2 dying on startup when using LASH
> for routing. Here is the trace:
>
> #0 0x0000000000459abf in get_lash_id (p_sw=0x5fbf40) at osm_ucast_lash.c:1124
> #1 0x000000000045a704 in osm_get_lash_sl (p_osm=0x7fffa0aba4d0,
> p_src_port=0x2aab06971400, p_dst_port=0x61ade0) at
> osm_ucast_lash.c:1450
> #2 0x000000000042e80d in __osm_pr_rcv_get_path_parms(p_rcv=0x7fffa0abba80,
> p_pr=0x2aab0661add0, p_src_port=0x2aab06971400,
> p_dest_port=0x61ade0,
> dest_lid_ho=2, comp_mask=580964351930793984, p_parms=0x649eef20)
> at osm_sa_path_record.c:685
> #3 0x000000000042f02b in __osm_pr_rcv_get_lid_pair_path (
> p_rcv=0x7fffa0abba80, p_pr=0x2aab0661add0,
> p_src_port=0x2aab06971400,
> p_dest_port=0x61ade0, p_dgid=0x649ef0a0, src_lid_ho=1,
> dest_lid_ho=2,
> comp_mask=580964351930793984, preference=0 '\0')
> at osm_sa_path_record.c:852
> #4 0x000000000042f5d6 in __osm_pr_rcv_get_port_pair_paths (
> p_rcv=0x7fffa0abba80, p_madw=0x6ecbb0, p_req_port=0x2aab06971400,
> p_src_port=0x2aab06971400, p_dest_port=0x61ade0,
> p_dgid=0x649ef0a0,
> comp_mask=580964351930793984, p_list=0x649ef0b0)
> at osm_sa_path_record.c:1072
> #5 0x000000000042fdc5 in __osm_pr_rcv_process_half(p_rcv=0x7fffa0abba80,
> p_madw=0x6ecbb0, requester_port=0x2aab06971400,
> p_src_port=0x2aab06971400,
> p_dest_port=0x0, p_dgid=0x649ef0a0, comp_mask=580964351930793984,
> p_list=0x649ef0b0) at osm_sa_path_record.c:1437
> #6 0x0000000000430c6f in osm_pr_rcv_process (context=0x7fffa0abba80,
> data=0x6ecbb0) at osm_sa_path_record.c:2003
> #7 0x00002b110a54ef57 in __cl_disp_worker (context=0x7fffa0abcb30)
> at cl_dispatcher.c:102
> #8 0x00002b110a5563b7 in __cl_thread_pool_routine(context=0x7fffa0abcba8)
> at cl_threadpool.c:74
> #9 0x00002b110a55620a in __cl_thread_wrapper (arg=0x5a4a40) at cl_thread.c:58
> #10 0x00002b110a21b143 in start_thread () from /lib64/libpthread.so.0
> #11 0x00002b110a82774d in clone () from /lib64/libc.so.6
> #12 0x0000000000000000 in ?? ()
>
> This is the switch:
>
> (gdb) p *( osm_switch_t *)0x5fbf40
> $3 = {map_item = {pool_item = {list_item = {p_next = 0x2aab065b4ed0,
> p_prev = 0x2aab069850f0}}, p_left = 0x2aab069850f0,
> p_right = 0x2aab065b4ed0, p_up = 0x2aab06587cd0, color =
> CL_MAP_BLACK,
> key = 17582052945261297672}, p_node = 0x608c80, switch_info = {
> lin_cap = 192, rand_cap = 0, mcast_cap = 4, lin_top = 32769,
> def_port = 0 '\0', def_mcast_pri_port = 0 '\0',
> def_mcast_not_port = 0 '\0', life_state = 144 '\220',
> lids_per_port = 0,
> enforce_cap = 8192, flags = 240 ''}, max_lid_ho = 0,
> num_ports = 25 '\031', num_hops = 0, hops = 0x0, p_prof = 0x5fbff0,
> fwd_tbl = {p_rnd_tbl = 0x0, p_lin_tbl = 0x7dd0f0}, mcast_tbl = {
> num_ports = 25 '\031', max_position = 1 '\001', max_block = 31,
> max_block_in_use = -1, num_entries = 1024, max_mlid_ho = 50176,
> p_mask_tbl = 0x7e9100}, discovery_count = 3, priv = 0x0}
>
> As you can see the priv pointer is NULL get_lash_id() follows it and
> dies.
>
> There is an obvious fix - simply check for priv in osm_get_lash_sl()
> and return OSM_DEFAULT_SL: it already does it when checking for src_id
> but not for dst_id but I'm not sure it's the right fix because I
> cannot quite understand how priv got to be NULL - it was reset in
> lash_cleanup()
I suspect that the failure scenario is different. This switch was just
connected/discovered by OpenSM (it has hops = 0x0 yet - this indicates
that it does not pass lid matrix generation stage yet) and it still be
uninitialized by LASH. If it is really so checking ->priv for NULL looks
like valid fix.
Is this reproducible failure?
Sasha
> but I don't see any threads which are inside
> discover_network_properties() and I would've thought that when opensm
> gets out of there, all switches must be initialized properly.
>
> I'm also not sure who initiated PATH_RECORD query - it does not look
> like opensm would do it to itself yet the requestor_port was on the
> name HCA. It could be another process running on the same host for
> what I know.
>
> max
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list