[ofa-general] Re: [PATCH] opensm: enforce routing paths rebalancing on switch reconnection

Sasha Khapyorsky sashak at voltaire.com
Wed Mar 5 10:22:12 PST 2008


On 09:10 Wed 05 Mar     , Al Chu wrote:
> 
> I can't restart opensm on that cluster at this time.  I don't recall any
> port errors.  However, I do recall seeing this output from
> __osm_state_mgr_light_sweep_start():
> 
> OSM_LOG(sm->p_log, OSM_LOG_ERROR,
>         "ERR 0108: "
>         "Unknown remote side for node 0x%016"
>         PRIx64
>         "(%s) port %u. Adding to light sweep sampling list\n",
>         cl_ntoh64(osm_node_get_node_guid
>                   (p_node)),
>         p_node->print_desc, port_num);
> 
> leading to a call to __osm_state_mgr_get_remote_port_info(), leading to
> what I fixed in osm_pi_rcv_process().

Yes, this is valid (handled) scenario.

What I cannot understand is why it doesn't reach
__osm_pi_rcv_process_switch_port() (where ignore_existing_lfts flag
should be enforced in accordance with port state) after querying port
with "unknown" remotes during a light sweep.

I did some experiments with ibsim and still not be able to reproduce
this. I'm afraid there could be some hidden bug which I'm not able to
catch yet.

> My original assumption was that the remote side for some ports wasn't
> known b/c the remote side ports were down.  Is it possible for opensm to
> not know about a remote side even if that remote side port is up/active?

I think yes, some ports could be DOWN during initial discovery and become
INIT later during LID assignment and/or link state setup. Normally (as in
your scenario) next light sweep catches this and enforce heavy sweep.

Sasha



More information about the general mailing list