[ofa-general] Re: [OpenSM] updn routing performance fix???

Fri Feb 29 09:58:23 PST 2008

Hey Sasha,

> Why to not keep is_bad flag on osm_physp_t itself - it would save some
> comparison loops?

Oh, that's a lot simpler :-)

> Here I added 'ignore_existing_lfts' flag per switch too. What do you
> think?

What you're trying to do is calculate "ignore_existing_lfts" when the port
trap is received rather than during routing later on?  Logically it looks
fine.  I tried to make a fix from the "trap side" instead of the "routing
side" initially too, but I didn't see a clean way to do it (obviously I
don't know the code as well).  I'll try it out when I get a chance.

(FYI, I noticed
+        if (p_physp->need_update)
should probably be:
+        if (p_physp->need_update && p_node->sw)
given the code a few lines above?
)

> Regardless to this it also could be useful to add to the console a
> command to set p_subn->ignore_existing_lfts up manually.

Yeah, like you said above, this would especially be needed when a new
switch is added to the network.  I'll work with Ira on this.

> Hmm, interesting... Are you running mpibench during heavy sweep? If so
> could the degradation be due to a fact of path migration and potential
> packet drops?

Afraid not, it was after the heavy sweeps.  I ran opensm in the foreground
and saw nothing going on besides the occasional lite sweep.

I've seen similar "inconsistencies" on performance when I've run ~120 node
jobs on this cluster.  So I personally think the tests are due to
randomness of the nodes selected.  I don't know if anything can be
definitive until a 140+ node job is run (which I don't know if I can :-().

Thanks,
Al

> Hi Al,
>
> On 20:17 Thu 28 Feb     , Albert Chu wrote:
>>
>> After some investigation, I found out that after the initial heavy sweep
>> is done, some of the ports on some switches are down (I assume hardware
>> racing during bringup), and thus opensm does not route through those
>> ports.  When opensm does a heavy resweep later on (I assume b/c some
>> traps
>> are received when those down ports come up), opensm keeps the same old
>> forwarding tables from before b/c ignore_existing_lfts is FALSE and b/c
>> the least hops are the same (other ports on the switch go to the same
>> parent).  Thus, we get healthy ports not forwarding to a parent switch.
>
> I see the problem. Actually I think it is even worse - for example if new
> switch(es) is connected to a fabric routing will not be rebalanced on
> existing ones.
>
>> There are multiple ways to deal with this.  I made the attached patch
>> which solved the problem on one of our test clusters.  It's pretty
>> simple.
>>  Store all of the "bad ports" that were found during a switch
>> configuration.  During the next heavy resweep, if some of those "bad
>> ports" are now up, I set ignore_existing_lfts to TRUE for just that
>> switch, leading to a completely new forwarding table of the switch.
>
> Why to not keep is_bad flag on osm_physp_t itself - it would save some
> comparison loops?
>
> Hmm, thinking more about this - currently we are tracking port state
> migrations to INIT during subnet discovery. It is to keep port tables
> up to date. I think it could be used for 'ignore_exsting_lfts' update as
> well. Something like this (not tested):
>
> diff --git a/opensm/include/opensm/osm_switch.h
> b/opensm/include/opensm/osm_switch.h
> index e2fe86d..567ff6f 100644
> --- a/opensm/include/opensm/osm_switch.h
> +++ b/opensm/include/opensm/osm_switch.h
> @@ -110,6 +110,7 @@ typedef struct _osm_switch {
>  	osm_mcast_tbl_t mcast_tbl;
>  	uint32_t discovery_count;
>  	unsigned need_update;
> +	unsigned ignore_existing_lfts;
>  	void *priv;
>  } osm_switch_t;
>  /*
> diff --git a/opensm/opensm/osm_port_info_rcv.c
> b/opensm/opensm/osm_port_info_rcv.c
> index ecac2a8..a1b547e 100644
> --- a/opensm/opensm/osm_port_info_rcv.c
> +++ b/opensm/opensm/osm_port_info_rcv.c
> @@ -316,6 +316,9 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm,
>
>  	if (ib_port_info_get_port_state(p_pi) > IB_LINK_INIT && p_node->sw)
>  		p_node->sw->need_update = 0;
> +
> +	if (p_physp->need_update)
> +		p_node->sw->ignore_existing_lfts = 1;
>
>  	if (port_num == 0)
>  		pi_rcv_check_and_fix_lid(sm->p_log, p_pi, p_physp);
> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
> index 38b2c4e..dec1d0a 100644
> --- a/opensm/opensm/osm_state_mgr.c
> +++ b/opensm/opensm/osm_state_mgr.c
> @@ -148,6 +148,7 @@ __osm_state_mgr_reset_switch_count(IN cl_map_item_t *
> const p_map_item,
>
>  	p_sw->discovery_count = 0;
>  	p_sw->need_update = 1;
> +	p_sw->ignore_existing_lfts = 0;
>  }
>
>  /**********************************************************************
> diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
> index d74cb6c..67223e5 100644
> --- a/opensm/opensm/osm_switch.c
> +++ b/opensm/opensm/osm_switch.c
> @@ -101,6 +101,7 @@ osm_switch_init(IN osm_switch_t * const p_sw,
>  	p_sw->switch_info = *p_si;
>  	p_sw->num_ports = num_ports;
>  	p_sw->need_update = 1;
> +	p_sw->ignore_existing_lfts = 1;
>
>  	status = osm_fwd_tbl_init(&p_sw->fwd_tbl, p_si);
>  	if (status != IB_SUCCESS)
> @@ -303,7 +304,7 @@ osm_switch_recommend_path(IN const osm_switch_t *
> const p_sw,
>  	   3. the physical port has a remote port (the link is up)
>  	   4. the port has min-hops to the target (avoid loops)
>  	 */
> -	if (!ignore_existing) {
> +	if (!ignore_existing && !p_sw->ignore_existing_lfts) {
>  		port_num = osm_fwd_tbl_get(&p_sw->fwd_tbl, lid_ho);
>
>  		if (port_num != OSM_NO_PATH) {
>
>
> Here I added 'ignore_existing_lfts' flag per switch too. What do you
> think?
>
> Regardless to this it also could be useful to add to the console a
> command to set p_subn->ignore_existing_lfts up manually.
>
>> During my performance testing on this patch, performance with a few
>> mpibench tests is actually worse by a few percent with this patch.  I am
>> only using 120 of 144 nodes on this cluster.  It's not a big cluster,
>> has
>> two levels worth of switches (24 port switches going up to a 288 port
>> switch.  Yup, the cluster is not "filled out" yet :-).  So there is some
>> randomness on which specific nodes run the job and if the lid routing
>> layout is better/worse for that specific set of nodes.
>>
>> Intuitively, we think this will be better as a whole even though my
>> current testing can't show it.  Can you think of anything that would
>> make
>> this patch worse for performance as a whole?  Could you see some side
>> effect leading to a lot more traffic on the network?
>
> Hmm, interesting... Are you running mpibench during heavy sweep? If so
> could the degradation be due to a fact of path migration and potential
> packet drops?
>
> Sasha
>

-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory