[ofa-general] [PATCH] opensm: Parallelize (Stripe) LFT sets across switches

Jason Gunthorpe jgunthorpe at obsidianresearch.com
Wed Jul 22 13:48:46 PDT 2009


On Wed, Jul 22, 2009 at 03:40:50PM -0400, Hal Rosenstock wrote:

> > Doing this without also using LID routing to the target switch is just
> > going to overload the SMAs in the intermediate switches with too many
> > DR SMPs.
> 
> The "processing" time of LR (LID routing) v. DR forwarding (direct
> routed) v. set/get of a forwarding table block is implementation
> dependent. The dominant factor is the block set/get rather than
> whether it is DR or LR forwarded.

I would be very surprised if any implementation had a significant
overhead for the actual set operation compared to the packet handling
path. Certainly in our products the incremental cost of a set vs
processing a DR is negligible. I expect similar results from any
switch. As soon as a SMP goes into the CPU for DR or other processing
there is an enormous hit.

IIRC when I last looked, it is reasonable to expect a 1us per hop for
DR vs a 100ns per hop for LID. If you are 5 switches deep that is 5us
vs 500ns!!

> The proposed algorithm reduces the potential VL15 overload on
> intermediate switches relative to the current algorithm for two
> reasons: the SMPs are spread across the switches first rather than
> blasting each switch in turn and there is a limit on the number of
> SMPs per node (not as good but less expensive than the concurrency
> limit you propose).

But you overload the switch the SM is connected to with processing
N*limit DR SMPs rather than just 'limit' SMPs. That is what concerns
me.

> > ??1) Within the BFS consider things as layers (number of hops from the
> > ?? ??SM). All switches in a layer can proceed largely in parallel,
> > ?? ??within the capability of the SM to capture the MAD replies at
> > ?? ??least. Advancing to the next layer must wait until the prior
> > ?? ??layer is fully programmed.
> 
> The premise of this patch is to spread the blocks across the switches
> first rather than populate an individual switch entirely before
> proceeding with the next. This is due to the handling of the blocks
> being significantly more time consuming than any of the forwarding. I
> think that principle should apply to this algorithm as well.

As is what I propose, an entire level can be done in parallel within
the two concurrency limits.

The reason for the BFS is to enable LID routed packets which I view as
the biggest gain. You have to progress outward building up paths back
to the SM to do this.

Randomly programming all switches does not enable LID routing.

[Though do note the BFS here is a special kind that traverses the
 LID routing graph for the SM LID to guarentee the above. A LID route
 that is not shortest-path will not work with a straight topological
 BFS]

> > ??2) At each switch send at most two partial DR MADs - where the MAD is
> > ?? ??LID routed up to the parent switch in the BFS, and then direct routed
> > ?? ??one step to the target switch. The two MADs would program the LFT
> > ?? ??block for the SM LID and for the switch LID.
> 
> Combined LR/DR routing is not a good idea IMO. Some switches don't support
> this although a requirement. Full DR routing could be used here
> rather

I first implemented an algorithm like this for switches based on Gamla
chips, and then for Anafa. If something doesn't support it, it is
very uncommon.

Thus, a simple global off switch for parallelism seems reasonable if
you think there is a worry. Frankly, I'm not very concerned about such
significant non-compliance.

> than the combined DR routing although it would be less efficient in
> terms of forwarding compared with the combined DR (with LR direct to
> the previous level of switch).

The problem with this is it makes managing the concurrency alot
harder, and you hit a bottleneck sooner, since you now have to track
every step along the DR path, not just the parent.
 
> > ??3) Sent LID routed mads directly to the target switch to fill in the
> > ?? ??rest of the LFT entries.
> 
> These should be striped across the switches at a given "level".

Yes, I imagine this algorithm running in parallel for all switches at
a level.

> > Step 2 needs to respect the concurrency limit for the parent switch,
> > and #3 needs to respect the concurrency limit for the target switch.
> 
> This is the harder part and also more expensive in terms of
> computation. This limit might also be overly conservative.

I don't see how it is more computational, you know the parent switch
because you are computing a DR path to it. A simple per-switch counter
is all that is needed.
 
> Also, IMO there would be one configuration item for this limit rather
> than a per switch configuration.

Yes, two configurables would be excellent. Something like 20 for DR
and 4 for Get/Set sounds reasonable to me.

> > Eliminating DR hops will significantly improve MAD round trip time
> > and give you more possible parallelism before the SMAs in intermediate
> > switches become overloaded.
> 
> I can see the MAD round trip time improvement based on the virtual
> reduction in number of forwarding hops but I don't see the increase in
> parallelism. SMPs are not flow controlled so I would think the
> parallelism is the same except that the transaction rate is somewhat
> higher using LR rather than DR SMPs.

In real switch implementations today, the DR path has a much lower
forwarding rate than the LID path. The LID path is wire speed, DR is
something like 4-20 packets in a burst at best - due to the CPU
involvement. If you ask 1000 switches to do a set via DR then the
switch the SM is hooked up to will probably start dropping. If you do
the same via LID, then no problems.

This is not any sort of IBA requirement, just a reflection of the
reality of all current implementations.

So, LID routing offloads the SMP processing in the switches closest to
the SM and lets you push more SMPs through them. You go from, say, a
global 20 parallel DR SMP limit at the SM's switch to a
20*N_Switches_At_Level limit, which is much higher.

> > into the above process and then all other switch programming MADs can
> > simply happen via LID routed packets. 
> Sure; it could be done for many other SMPs but the most important
> thing in speeding up config time are the LFT/MFT block sets.

Once you get LID routing setup there is no reason to keep using DR
after that point. For instance, just converting MFT programming to use
LID only would probably result in noticable gains on big fabrics.

Jason



More information about the general mailing list