[ofa-general] opensm dumps core when using LASH for routing
Max Matveev
makc at sgi.com
Fri Jan 11 03:07:39 PST 2008
I've got opensm 3.0.3 from OFED 1.2 dying on startup when using LASH
for routing. Here is the trace:
#0 0x0000000000459abf in get_lash_id (p_sw=0x5fbf40) at osm_ucast_lash.c:1124
#1 0x000000000045a704 in osm_get_lash_sl (p_osm=0x7fffa0aba4d0,
p_src_port=0x2aab06971400, p_dst_port=0x61ade0) at
osm_ucast_lash.c:1450
#2 0x000000000042e80d in __osm_pr_rcv_get_path_parms(p_rcv=0x7fffa0abba80,
p_pr=0x2aab0661add0, p_src_port=0x2aab06971400,
p_dest_port=0x61ade0,
dest_lid_ho=2, comp_mask=580964351930793984, p_parms=0x649eef20)
at osm_sa_path_record.c:685
#3 0x000000000042f02b in __osm_pr_rcv_get_lid_pair_path (
p_rcv=0x7fffa0abba80, p_pr=0x2aab0661add0,
p_src_port=0x2aab06971400,
p_dest_port=0x61ade0, p_dgid=0x649ef0a0, src_lid_ho=1,
dest_lid_ho=2,
comp_mask=580964351930793984, preference=0 '\0')
at osm_sa_path_record.c:852
#4 0x000000000042f5d6 in __osm_pr_rcv_get_port_pair_paths (
p_rcv=0x7fffa0abba80, p_madw=0x6ecbb0, p_req_port=0x2aab06971400,
p_src_port=0x2aab06971400, p_dest_port=0x61ade0,
p_dgid=0x649ef0a0,
comp_mask=580964351930793984, p_list=0x649ef0b0)
at osm_sa_path_record.c:1072
#5 0x000000000042fdc5 in __osm_pr_rcv_process_half(p_rcv=0x7fffa0abba80,
p_madw=0x6ecbb0, requester_port=0x2aab06971400,
p_src_port=0x2aab06971400,
p_dest_port=0x0, p_dgid=0x649ef0a0, comp_mask=580964351930793984,
p_list=0x649ef0b0) at osm_sa_path_record.c:1437
#6 0x0000000000430c6f in osm_pr_rcv_process (context=0x7fffa0abba80,
data=0x6ecbb0) at osm_sa_path_record.c:2003
#7 0x00002b110a54ef57 in __cl_disp_worker (context=0x7fffa0abcb30)
at cl_dispatcher.c:102
#8 0x00002b110a5563b7 in __cl_thread_pool_routine(context=0x7fffa0abcba8)
at cl_threadpool.c:74
#9 0x00002b110a55620a in __cl_thread_wrapper (arg=0x5a4a40) at cl_thread.c:58
#10 0x00002b110a21b143 in start_thread () from /lib64/libpthread.so.0
#11 0x00002b110a82774d in clone () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()
This is the switch:
(gdb) p *( osm_switch_t *)0x5fbf40
$3 = {map_item = {pool_item = {list_item = {p_next = 0x2aab065b4ed0,
p_prev = 0x2aab069850f0}}, p_left = 0x2aab069850f0,
p_right = 0x2aab065b4ed0, p_up = 0x2aab06587cd0, color =
CL_MAP_BLACK,
key = 17582052945261297672}, p_node = 0x608c80, switch_info = {
lin_cap = 192, rand_cap = 0, mcast_cap = 4, lin_top = 32769,
def_port = 0 '\0', def_mcast_pri_port = 0 '\0',
def_mcast_not_port = 0 '\0', life_state = 144 '\220',
lids_per_port = 0,
enforce_cap = 8192, flags = 240 ''}, max_lid_ho = 0,
num_ports = 25 '\031', num_hops = 0, hops = 0x0, p_prof = 0x5fbff0,
fwd_tbl = {p_rnd_tbl = 0x0, p_lin_tbl = 0x7dd0f0}, mcast_tbl = {
num_ports = 25 '\031', max_position = 1 '\001', max_block = 31,
max_block_in_use = -1, num_entries = 1024, max_mlid_ho = 50176,
p_mask_tbl = 0x7e9100}, discovery_count = 3, priv = 0x0}
As you can see the priv pointer is NULL get_lash_id() follows it and
dies.
There is an obvious fix - simply check for priv in osm_get_lash_sl()
and return OSM_DEFAULT_SL: it already does it when checking for src_id
but not for dst_id but I'm not sure it's the right fix because I
cannot quite understand how priv got to be NULL - it was reset in
lash_cleanup() but I don't see any threads which are inside
discover_network_properties() and I would've thought that when opensm
gets out of there, all switches must be initialized properly.
I'm also not sure who initiated PATH_RECORD query - it does not look
like opensm would do it to itself yet the requestor_port was on the
name HCA. It could be another process running on the same host for
what I know.
max
More information about the general
mailing list