[ofa-general] [PATCH] opensm: Parallelize (Stripe) LFT sets across switches

Hal Rosenstock hal.rosenstock at gmail.com
Wed Jul 22 12:40:50 PDT 2009


On Tue, Jul 21, 2009 at 2:44 PM, Jason
Gunthorpe<jgunthorpe at obsidianresearch.com> wrote:
> On Tue, Jul 21, 2009 at 02:03:12PM -0400, Hal Rosenstock wrote:
>
>> Currently, MADs are pipelined to a single switch at a time which
>> effectively serializes these requests due to processing at the SMA.
>> This patch pipelines (stripes) them across the switches first before
>> proceeding with successive blocks. As a result of this striping,
>> multiple switches can process the set and respond concurrently
>> which results in an improvement to the subnet initialization time.
>
> Doing this without also using LID routing to the target switch is just
> going to overload the SMAs in the intermediate switches with too many
> DR SMPs.

The "processing" time of LR (LID routing) v. DR forwarding (direct
routed) v. set/get of a forwarding table block is implementation
dependent. The dominant factor is the block set/get rather than
whether it is DR or LR forwarded.

The proposed algorithm reduces the potential VL15 overload on
intermediate switches relative to the current algorithm for two
reasons: the SMPs are spread across the switches first rather than
blasting each switch in turn and there is a limit on the number of
SMPs per node (not as good but less expensive than the concurrency
limit you propose).

> The most efficient approach is to program LFTs using a breadth first
> search of the connectivity graph starting from the SM end port:

A BFS ordered switch list is pretty straightforward.

>  1) Within the BFS consider things as layers (number of hops from the
>    SM). All switches in a layer can proceed largely in parallel,
>    within the capability of the SM to capture the MAD replies at
>    least. Advancing to the next layer must wait until the prior
>    layer is fully programmed.

The premise of this patch is to spread the blocks across the switches
first rather than populate an individual switch entirely before
proceeding with the next. This is due to the handling of the blocks
being significantly more time consuming than any of the forwarding. I
think that principle should apply to this algorithm as well.

>  2) At each switch send at most two partial DR MADs - where the MAD is
>    LID routed up to the parent switch in the BFS, and then direct routed
>    one step to the target switch. The two MADs would program the LFT
>    block for the SM LID and for the switch LID.

Combined LR/DR routing is not a good idea IMO. Some switches don't support
this although a requirement. Full DR routing could be used here rather
than the combined DR routing although it would be less efficient in
terms of forwarding compared with the combined DR (with LR direct to
the previous level of switch).

>  3) Sent LID routed mads directly to the target switch to fill in the
>    rest of the LFT entries.

These should be striped across the switches at a given "level".

> Step 2 needs to respect the concurrency limit for the parent switch,
> and #3 needs to respect the concurrency limit for the target switch.

This is the harder part and also more expensive in terms of
computation. This limit might also be overly conservative.

Also, IMO there would be one configuration item for this limit rather
than a per switch configuration.

> As you go out the BFS there are more parent swtiches and target
> switches available to run in parallel.

> Eliminating DR hops will significantly improve MAD round trip time
> and give you more possible parallelism before the SMAs in intermediate
> switches become overloaded.

I can see the MAD round trip time improvement based on the virtual
reduction in number of forwarding hops but I don't see the increase in
parallelism. SMPs are not flow controlled so I would think the
parallelism is the same except that the transaction rate is somewhat
higher using LR rather than DR SMPs.

> Overall, optimizing things so that as few DR MADs are used as possible
> is desirable. The NodeInfo write to set the switch LID can also be put
                            ^^^^^^^^^^
                            PortInfo
> into the above process and then all other switch programming MADs can
> simply happen via LID routed packets.

Sure; it could be done for many other SMPs but the most important
thing in speeding up config time are the LFT/MFT block sets.

-- Hal

> Jason
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>



More information about the general mailing list