[ofa-general] Re: [PATCH] opensm: enforce routing paths rebalancing on switch reconnection
Sasha Khapyorsky
sashak at voltaire.com
Wed Mar 5 10:22:12 PST 2008
On 09:10 Wed 05 Mar , Al Chu wrote:
>
> I can't restart opensm on that cluster at this time. I don't recall any
> port errors. However, I do recall seeing this output from
> __osm_state_mgr_light_sweep_start():
>
> OSM_LOG(sm->p_log, OSM_LOG_ERROR,
> "ERR 0108: "
> "Unknown remote side for node 0x%016"
> PRIx64
> "(%s) port %u. Adding to light sweep sampling list\n",
> cl_ntoh64(osm_node_get_node_guid
> (p_node)),
> p_node->print_desc, port_num);
>
> leading to a call to __osm_state_mgr_get_remote_port_info(), leading to
> what I fixed in osm_pi_rcv_process().
Yes, this is valid (handled) scenario.
What I cannot understand is why it doesn't reach
__osm_pi_rcv_process_switch_port() (where ignore_existing_lfts flag
should be enforced in accordance with port state) after querying port
with "unknown" remotes during a light sweep.
I did some experiments with ibsim and still not be able to reproduce
this. I'm afraid there could be some hidden bug which I'm not able to
catch yet.
> My original assumption was that the remote side for some ports wasn't
> known b/c the remote side ports were down. Is it possible for opensm to
> not know about a remote side even if that remote side port is up/active?
I think yes, some ports could be DOWN during initial discovery and become
INIT later during LID assignment and/or link state setup. Normally (as in
your scenario) next light sweep catches this and enforce heavy sweep.
Sasha
More information about the general
mailing list