[ofa-general] opensm dumps core when using LASH for routing

Max Matveev makc at sgi.com
Fri Jan 11 03:07:39 PST 2008


I've got opensm 3.0.3 from OFED 1.2 dying on startup when using LASH
for routing.  Here is the trace:

#0  0x0000000000459abf in get_lash_id (p_sw=0x5fbf40) at osm_ucast_lash.c:1124
#1  0x000000000045a704 in osm_get_lash_sl (p_osm=0x7fffa0aba4d0,
    p_src_port=0x2aab06971400, p_dst_port=0x61ade0) at
    osm_ucast_lash.c:1450
#2  0x000000000042e80d in __osm_pr_rcv_get_path_parms(p_rcv=0x7fffa0abba80,
    p_pr=0x2aab0661add0, p_src_port=0x2aab06971400,
    p_dest_port=0x61ade0,
    dest_lid_ho=2, comp_mask=580964351930793984, p_parms=0x649eef20)
    at osm_sa_path_record.c:685
#3  0x000000000042f02b in __osm_pr_rcv_get_lid_pair_path (
    p_rcv=0x7fffa0abba80, p_pr=0x2aab0661add0,
    p_src_port=0x2aab06971400,
    p_dest_port=0x61ade0, p_dgid=0x649ef0a0, src_lid_ho=1,
    dest_lid_ho=2,
    comp_mask=580964351930793984, preference=0 '\0')
    at osm_sa_path_record.c:852
#4  0x000000000042f5d6 in __osm_pr_rcv_get_port_pair_paths (
    p_rcv=0x7fffa0abba80, p_madw=0x6ecbb0, p_req_port=0x2aab06971400,
    p_src_port=0x2aab06971400, p_dest_port=0x61ade0,
    p_dgid=0x649ef0a0,
    comp_mask=580964351930793984, p_list=0x649ef0b0)
    at osm_sa_path_record.c:1072
#5  0x000000000042fdc5 in __osm_pr_rcv_process_half(p_rcv=0x7fffa0abba80,
    p_madw=0x6ecbb0, requester_port=0x2aab06971400,
    p_src_port=0x2aab06971400,
    p_dest_port=0x0, p_dgid=0x649ef0a0, comp_mask=580964351930793984,
    p_list=0x649ef0b0) at osm_sa_path_record.c:1437
#6  0x0000000000430c6f in osm_pr_rcv_process (context=0x7fffa0abba80,
    data=0x6ecbb0) at osm_sa_path_record.c:2003
#7  0x00002b110a54ef57 in __cl_disp_worker (context=0x7fffa0abcb30)
    at cl_dispatcher.c:102
#8  0x00002b110a5563b7 in __cl_thread_pool_routine(context=0x7fffa0abcba8)
    at cl_threadpool.c:74
#9  0x00002b110a55620a in __cl_thread_wrapper (arg=0x5a4a40) at cl_thread.c:58
#10 0x00002b110a21b143 in start_thread () from /lib64/libpthread.so.0
#11 0x00002b110a82774d in clone () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

This is the switch:

(gdb) p *( osm_switch_t *)0x5fbf40
$3 = {map_item = {pool_item = {list_item = {p_next = 0x2aab065b4ed0,
        p_prev = 0x2aab069850f0}}, p_left = 0x2aab069850f0,
    p_right = 0x2aab065b4ed0, p_up = 0x2aab06587cd0, color =
    CL_MAP_BLACK,
    key = 17582052945261297672}, p_node = 0x608c80, switch_info = {
    lin_cap = 192, rand_cap = 0, mcast_cap = 4, lin_top = 32769,
    def_port = 0 '\0', def_mcast_pri_port = 0 '\0',
    def_mcast_not_port = 0 '\0', life_state = 144 '\220',
    lids_per_port = 0,
    enforce_cap = 8192, flags = 240 ''}, max_lid_ho = 0,
  num_ports = 25 '\031', num_hops = 0, hops = 0x0, p_prof = 0x5fbff0,
  fwd_tbl = {p_rnd_tbl = 0x0, p_lin_tbl = 0x7dd0f0}, mcast_tbl = {
    num_ports = 25 '\031', max_position = 1 '\001', max_block = 31,
    max_block_in_use = -1, num_entries = 1024, max_mlid_ho = 50176,
    p_mask_tbl = 0x7e9100}, discovery_count = 3, priv = 0x0}

As you can see the priv pointer is NULL get_lash_id() follows it and
dies.

There is an obvious fix - simply check for priv in osm_get_lash_sl()
and return OSM_DEFAULT_SL: it already does it when checking for src_id
but not for dst_id but I'm not sure it's the right fix because I
cannot quite understand how priv got to be NULL - it was reset in
lash_cleanup() but I don't see any threads which are inside
discover_network_properties() and I would've thought that when opensm
gets out of there, all switches must be initialized properly.

I'm also not sure who initiated PATH_RECORD query - it does not look
like opensm would do it to itself yet the requestor_port was on the
name HCA. It could be another process running on the same host for
what I know.

max



More information about the general mailing list