[ofa-general] [PATCH] opensm: Parallelize (Stripe) LFT sets across switches

Thu Jul 23 05:53:07 PDT 2009

Jason,

Jason Gunthorpe wrote:
> On Wed, Jul 22, 2009 at 03:40:50PM -0400, Hal Rosenstock wrote:
> 
>>> Doing this without also using LID routing to the target switch is just
>>> going to overload the SMAs in the intermediate switches with too many
>>> DR SMPs.
>> The "processing" time of LR (LID routing) v. DR forwarding (direct
>> routed) v. set/get of a forwarding table block is implementation
>> dependent. The dominant factor is the block set/get rather than
>> whether it is DR or LR forwarded.
> 
> I would be very surprised if any implementation had a significant
> overhead for the actual set operation compared to the packet handling
> path. Certainly in our products the incremental cost of a set vs
> processing a DR is negligible. I expect similar results from any
> switch. As soon as a SMP goes into the CPU for DR or other processing
> there is an enormous hit.

Whether the DR packets forwarding is done in HW, FW or SW is
implementation dependent. I agree with you that if the same
entity is processing LFT block set and DR MAD forwarding, then
the difference between these two operations would not be very
big (having said that, I'm not sure it's negligible - again,
it's implementation dependent).

I don't know about all the IB switches, but in InfiniScale IV
(and any InfiniScaleIV-based switches out there) DR packets
forwarding is done in HW, so its significantly faster than
processing LFT block by the SMA.

Same goes for any future IB switch that will improve handling
of DR MADs forwarding.

-- Yevgeny

> IIRC when I last looked, it is reasonable to expect a 1us per hop for
> DR vs a 100ns per hop for LID. If you are 5 switches deep that is 5us
> vs 500ns!!
> 
>> The proposed algorithm reduces the potential VL15 overload on
>> intermediate switches relative to the current algorithm for two
>> reasons: the SMPs are spread across the switches first rather than
>> blasting each switch in turn and there is a limit on the number of
>> SMPs per node (not as good but less expensive than the concurrency
>> limit you propose).
> 
> But you overload the switch the SM is connected to with processing
> N*limit DR SMPs rather than just 'limit' SMPs. That is what concerns
> me.
> 
>>> ??1) Within the BFS consider things as layers (number of hops from the
>>> ?? ??SM). All switches in a layer can proceed largely in parallel,
>>> ?? ??within the capability of the SM to capture the MAD replies at
>>> ?? ??least. Advancing to the next layer must wait until the prior
>>> ?? ??layer is fully programmed.
>> The premise of this patch is to spread the blocks across the switches
>> first rather than populate an individual switch entirely before
>> proceeding with the next. This is due to the handling of the blocks
>> being significantly more time consuming than any of the forwarding. I
>> think that principle should apply to this algorithm as well.
> 
> As is what I propose, an entire level can be done in parallel within
> the two concurrency limits.
> 
> The reason for the BFS is to enable LID routed packets which I view as
> the biggest gain. You have to progress outward building up paths back
> to the SM to do this.
> 
> Randomly programming all switches does not enable LID routing.
> 
> [Though do note the BFS here is a special kind that traverses the
>  LID routing graph for the SM LID to guarentee the above. A LID route
>  that is not shortest-path will not work with a straight topological
>  BFS]
> 
>>> ??2) At each switch send at most two partial DR MADs - where the MAD is
>>> ?? ??LID routed up to the parent switch in the BFS, and then direct routed
>>> ?? ??one step to the target switch. The two MADs would program the LFT
>>> ?? ??block for the SM LID and for the switch LID.
>> Combined LR/DR routing is not a good idea IMO. Some switches don't support
>> this although a requirement. Full DR routing could be used here
>> rather
> 
> I first implemented an algorithm like this for switches based on Gamla
> chips, and then for Anafa. If something doesn't support it, it is
> very uncommon.
> 
> Thus, a simple global off switch for parallelism seems reasonable if
> you think there is a worry. Frankly, I'm not very concerned about such
> significant non-compliance.
> 
>> than the combined DR routing although it would be less efficient in
>> terms of forwarding compared with the combined DR (with LR direct to
>> the previous level of switch).
> 
> The problem with this is it makes managing the concurrency alot
> harder, and you hit a bottleneck sooner, since you now have to track
> every step along the DR path, not just the parent.
>  
>>> ??3) Sent LID routed mads directly to the target switch to fill in the
>>> ?? ??rest of the LFT entries.
>> These should be striped across the switches at a given "level".
> 
> Yes, I imagine this algorithm running in parallel for all switches at
> a level.
> 
>>> Step 2 needs to respect the concurrency limit for the parent switch,
>>> and #3 needs to respect the concurrency limit for the target switch.
>> This is the harder part and also more expensive in terms of
>> computation. This limit might also be overly conservative.
> 
> I don't see how it is more computational, you know the parent switch
> because you are computing a DR path to it. A simple per-switch counter
> is all that is needed.
>  
>> Also, IMO there would be one configuration item for this limit rather
>> than a per switch configuration.
> 
> Yes, two configurables would be excellent. Something like 20 for DR
> and 4 for Get/Set sounds reasonable to me.
> 
>>> Eliminating DR hops will significantly improve MAD round trip time
>>> and give you more possible parallelism before the SMAs in intermediate
>>> switches become overloaded.
>> I can see the MAD round trip time improvement based on the virtual
>> reduction in number of forwarding hops but I don't see the increase in
>> parallelism. SMPs are not flow controlled so I would think the
>> parallelism is the same except that the transaction rate is somewhat
>> higher using LR rather than DR SMPs.
> 
> In real switch implementations today, the DR path has a much lower
> forwarding rate than the LID path. The LID path is wire speed, DR is
> something like 4-20 packets in a burst at best - due to the CPU
> involvement. If you ask 1000 switches to do a set via DR then the
> switch the SM is hooked up to will probably start dropping. If you do
> the same via LID, then no problems.
> 
> This is not any sort of IBA requirement, just a reflection of the
> reality of all current implementations.
> 
> So, LID routing offloads the SMP processing in the switches closest to
> the SM and lets you push more SMPs through them. You go from, say, a
> global 20 parallel DR SMP limit at the SM's switch to a
> 20*N_Switches_At_Level limit, which is much higher.
> 
>>> into the above process and then all other switch programming MADs can
>>> simply happen via LID routed packets. 
>> Sure; it could be done for many other SMPs but the most important
>> thing in speeding up config time are the LFT/MFT block sets.
> 
> Once you get LID routing setup there is no reason to keep using DR
> after that point. For instance, just converting MFT programming to use
> LID only would probably result in noticable gains on big fabrics.
> 
> Jason
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>