[ofa-general] [PATCH] opensm: Parallelize (Stripe) LFT sets across switches

Wed Jul 22 17:28:25 PDT 2009

On Wed, Jul 22, 2009 at 4:48 PM, Jason
Gunthorpe<jgunthorpe at obsidianresearch.com> wrote:
> On Wed, Jul 22, 2009 at 03:40:50PM -0400, Hal Rosenstock wrote:
>
>> > Doing this without also using LID routing to the target switch is just
>> > going to overload the SMAs in the intermediate switches with too many
>> > DR SMPs.
>>
>> The "processing" time of LR (LID routing) v. DR forwarding (direct
>> routed) v. set/get of a forwarding table block is implementation
>> dependent. The dominant factor is the block set/get rather than
>> whether it is DR or LR forwarded.
>
> I would be very surprised if any implementation had a significant
> overhead for the actual set operation compared to the packet handling
> path. Certainly in our products the incremental cost of a set vs
> processing a DR is negligible. I expect similar results from any
> switch. As soon as a SMP goes into the CPU for DR or other processing
> there is an enormous hit.
>
> IIRC when I last looked, it is reasonable to expect a 1us per hop for
> DR vs a 100ns per hop for LID. If you are 5 switches deep that is 5us
> vs 500ns!!
>
>> The proposed algorithm reduces the potential VL15 overload on
>> intermediate switches relative to the current algorithm for two
>> reasons: the SMPs are spread across the switches first rather than
>> blasting each switch in turn and there is a limit on the number of
>> SMPs per node (not as good but less expensive than the concurrency
>> limit you propose).
>
> But you overload the switch the SM is connected to with processing
> N*limit DR SMPs rather than just 'limit' SMPs. That is what concerns
> me.

As I said, the current algorithm is worse as it sends N*no limit DR
SMPs (where no limit means any needed blocks). Not sure that VL15
droppage due to this has been identified. So I think this improves on
what's been deployed and seemingly works in OpenSM for quite some time
now.

>> > ??1) Within the BFS consider things as layers (number of hops from the
>> > ?? ??SM). All switches in a layer can proceed largely in parallel,
>> > ?? ??within the capability of the SM to capture the MAD replies at
>> > ?? ??least. Advancing to the next layer must wait until the prior
>> > ?? ??layer is fully programmed.
>>
>> The premise of this patch is to spread the blocks across the switches
>> first rather than populate an individual switch entirely before
>> proceeding with the next. This is due to the handling of the blocks
>> being significantly more time consuming than any of the forwarding. I
>> think that principle should apply to this algorithm as well.
>
> As is what I propose, an entire level can be done in parallel within
> the two concurrency limits.
>
> The reason for the BFS is to enable LID routed packets which I view as
> the biggest gain. You have to progress outward building up paths back
> to the SM to do this.
>
> Randomly programming all switches does not enable LID routing.
>
> [Though do note the BFS here is a special kind that traverses the
>  LID routing graph for the SM LID to guarentee the above. A LID route
>  that is not shortest-path will not work with a straight topological
>  BFS]
>
>> > ??2) At each switch send at most two partial DR MADs - where the MAD is
>> > ?? ??LID routed up to the parent switch in the BFS, and then direct routed
>> > ?? ??one step to the target switch. The two MADs would program the LFT
>> > ?? ??block for the SM LID and for the switch LID.
>>
>> Combined LR/DR routing is not a good idea IMO. Some switches don't support
>> this although a requirement. Full DR routing could be used here
>> rather
>
> I first implemented an algorithm like this for switches based on Gamla
> chips, and then for Anafa. If something doesn't support it, it is
> very uncommon.

I'm aware of at least two very different switches where this is the case.

> Thus, a simple global off switch for parallelism seems reasonable if
> you think there is a worry. Frankly, I'm not very concerned about such
> significant non-compliance.

There are no compliance tests for this so it's a relatively untested
feature (other than ibportstate which uses this). Although you are not
concerned, I don't want to orphan these deployed switches and
supporting multiple modes complicates matters some.

>> than the combined DR routing although it would be less efficient in
>> terms of forwarding compared with the combined DR (with LR direct to
>> the previous level of switch).
>
> The problem with this is it makes managing the concurrency alot
> harder, and you hit a bottleneck sooner, since you now have to track
> every step along the DR path, not just the parent.

Understood; that's what I meant when I wrote below that it's harder
and more expensive computationally. I think that it's also overly
pessimistic so the number might want to be made artificially higher
based on experience that these SMPs can be pipelined quite a bit more
than this would allow.

In short, the current algorithm is most optimisitic/agressive, the
proposed one is less so, and yours is most conservative/robust.

>> > ??3) Sent LID routed mads directly to the target switch to fill in the
>> > ?? ??rest of the LFT entries.
>>
>> These should be striped across the switches at a given "level".
>
> Yes, I imagine this algorithm running in parallel for all switches at
> a level.
>
>> > Step 2 needs to respect the concurrency limit for the parent switch,
>> > and #3 needs to respect the concurrency limit for the target switch.
>>
>> This is the harder part and also more expensive in terms of
>> computation. This limit might also be overly conservative.
>
> I don't see how it is more computational, you know the parent switch
> because you are computing a DR path to it. A simple per-switch counter
> is all that is needed.

I was referring to the case where all hops are tracked not just the parent.

>> Also, IMO there would be one configuration item for this limit rather
>> than a per switch configuration.
>
> Yes, two configurables would be excellent. Something like 20 for DR
> and 4 for Get/Set sounds reasonable to me.
>
>> > Eliminating DR hops will significantly improve MAD round trip time
>> > and give you more possible parallelism before the SMAs in intermediate
>> > switches become overloaded.
>>
>> I can see the MAD round trip time improvement based on the virtual
>> reduction in number of forwarding hops but I don't see the increase in
>> parallelism. SMPs are not flow controlled so I would think the
>> parallelism is the same except that the transaction rate is somewhat
>> higher using LR rather than DR SMPs.
>
> In real switch implementations today, the DR path has a much lower
> forwarding rate than the LID path. The LID path is wire speed, DR is
> something like 4-20 packets in a burst at best - due to the CPU
> involvement. If you ask 1000 switches to do a set via DR then the
> switch the SM is hooked up to will probably start dropping. If you do
> the same via LID, then no problems.
>
> This is not any sort of IBA requirement, just a reflection of the
> reality of all current implementations.
>
> So, LID routing offloads the SMP processing in the switches closest to
> the SM and lets you push more SMPs through them. You go from, say, a
> global 20 parallel DR SMP limit at the SM's switch to a
> 20*N_Switches_At_Level limit, which is much higher.
>
>> > into the above process and then all other switch programming MADs can
>> > simply happen via LID routed packets.
>> Sure; it could be done for many other SMPs but the most important
>> thing in speeding up config time are the LFT/MFT block sets.
>
> Once you get LID routing setup there is no reason to keep using DR
> after that point. For instance, just converting MFT programming to use
> LID only would probably result in noticable gains on big fabrics.

I was talking more about the LFT gain (which still is dependent on the
size of the network (end ports)). MFT gains are less and are based on
the number of multicast groups in use and whether or not MLID
overloading is being used.

-- Hal

> Jason
>