[ofa-general] Re: [PATCH] opensm: enforce routing paths rebalancing on switch reconnection

Al Chu chu11 at llnl.gov
Wed Mar 5 11:20:12 PST 2008


On Wed, 2008-03-05 at 18:22 +0000, Sasha Khapyorsky wrote:
> On 09:10 Wed 05 Mar     , Al Chu wrote:
> > 
> > I can't restart opensm on that cluster at this time.  I don't recall any
> > port errors.  However, I do recall seeing this output from
> > __osm_state_mgr_light_sweep_start():
> > 
> > OSM_LOG(sm->p_log, OSM_LOG_ERROR,
> >         "ERR 0108: "
> >         "Unknown remote side for node 0x%016"
> >         PRIx64
> >         "(%s) port %u. Adding to light sweep sampling list\n",
> >         cl_ntoh64(osm_node_get_node_guid
> >                   (p_node)),
> >         p_node->print_desc, port_num);
> > 
> > leading to a call to __osm_state_mgr_get_remote_port_info(), leading to
> > what I fixed in osm_pi_rcv_process().
> 
> Yes, this is valid (handled) scenario.
> 
> What I cannot understand is why it doesn't reach
> __osm_pi_rcv_process_switch_port() (where ignore_existing_lfts flag
> should be enforced in accordance with port state) after querying port
> with "unknown" remotes during a light sweep.
>
> I did some experiments with ibsim and still not be able to reproduce
> this. I'm afraid there could be some hidden bug which I'm not able to
> catch yet.
> 
> > My original assumption was that the remote side for some ports wasn't
> > known b/c the remote side ports were down.  Is it possible for opensm to
> > not know about a remote side even if that remote side port is up/active?
> 
> I think yes, some ports could be DOWN during initial discovery and become
> INIT later during LID assignment and/or link state setup. Normally (as in
> your scenario) next light sweep catches this and enforce heavy sweep.

Perhaps it does "reach __osm_pi_rcv_process_switch_port", but the
need_update flag is just not set?  Is it possible for those remote side
ports to be at ARMED or ACTIVE before the 2nd heavy sweep?  If so, then
that remote side port would have their need_update flag cleared, and
thus ignore_existing_lfts wouldn't be set in
__osm_pi_rcv_process_switch_port().

Al

> Sasha
-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory



More information about the general mailing list