[ofw] [RFC] Remove path query from IPoIB

Hal Rosenstock hal.rosenstock at gmail.com
Wed Aug 6 10:03:43 PDT 2008


Hi again Fab,

On Wed, Aug 6, 2008 at 11:39 AM, Fab Tillier
<ftillier at windows.microsoft.com> wrote:
> Hi Hal,
>
> Thanks for taking a look and responding.
>
>> Hi Fab,
>>
>> On Tue, Aug 5, 2008 at 6:25 PM, Fab Tillier
>> <ftillier at windows.microsoft.com> wrote:
>>> I wanted to get this out for comments before I complete testing.
>>> This change removes path queries from unicast traffic in IPoIB.  It
>>> makes use of information from receive work completions (SLID, SGID) and
>>> the broadcast group (SL, flow label, hop limit, traffic class, static
>>> rate) to form the address vectors.
>>
>> I don't think it's required that the flow label, hop limit, and
>> traffic class are the same for unicast and broadcast traffic. (I
>> forget about whether the same is true for SL). Also, static rate might
>> be pessimistic for unicast.
>
> Can the flow label, hop limit, traffic class, static rate, and service level from the MC group be wrong for unicast traffic, versus just submoptimal?  I'm OK with things not being optimal as long as they're not broken, because in the common case today these settings are the same for unicast and multicast (at least from my investigation into OpenSM.)

As far as static rate goes, it is just suboptimal in a non homogeneous
rate subnet. For SL, I believe IPoIB in Linux uses the partition SL
(broadcast group SL) for unicast. It is the GRH parameters which are
not needed for intrasubnet and that there is less certainty.
Currently, within OpenSM, HopLimit is just used for whether things are
intrasubnet or not (intersubnet just uses max HopLimit but
configuration could be added). I don't think traffic class is really
used. The use of these will be better defined by the routing spec when
completed.

> The theory here is that if the MC group can reach everyone in the broadcast group, these values must be at least good enough to reach everyone via unicast packets.  Is this incorrect?

MC parameters are for the group which is different than any unicast
communication. I'm not sure that any parameters by which a source can
reach the broadcast (MC) group are guaranteed to work for any source
to any destination in that group (unicast). It may be a matter of
implementation but it seems like short of limitations on the GRH
fields this should work AFAIT right now (along with the possibly
suboptimal rate).

> I have issues today running MPI jobs in 64-node clusters because the SM can't keep up.  I don't think any tweaks to the SM can give me an order of magnitude better performance.

How fast/beefy was your SM node ?

<soapbox>
I think this is perhaps both an implementation issue (with SMs/OpenSM)
as they need to scale better in SA performance with the number of
cores and also perhaps an architectural one for the IBTA. I don't
think it's a good thing to keep subverting the architecture (for
possibly valid reasons) as new issues are introduced by doing this.
</soapbox>

-- Hal

>I tried exponential back-off for SA queries, but that just moved the problem up the stack to the ARP requests timing out because the ARP responses were waiting on the SM for a path.  Things only get worse as the number of cores in a server increases.

> -Fab
>



More information about the ofw mailing list