SPAM Re: [ofw] [RFC] Locally generated path records

Mon Jul 21 13:51:29 PDT 2008

Hi Fab,

On Mon, Jul 21, 2008 at 4:02 PM, Fab Tillier
<ftillier at windows.microsoft.com> wrote:
> Hi Hal,
>
>> From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
>> Sent: Monday, July 21, 2008 11:40 AM
>>
>> Hi Fab,
>>
>> On Tue, Jul 15, 2008 at 3:23 PM, Fab Tillier
>> <ftillier at windows.microsoft.com> wrote:
>>> How smart is the SM when it generates path records?
>>
>> Not sure what you mean by "smart". Also, is the issue below with
>> OpenSM and if so, which version (Windows or Linux) ?
>
> This is Windows OpenSM, built from the tip of the trunk.

svn trunk is very old and moldy for OpenSM.

>
>> If this is
>> Windows OpenSM, there are improvements in SA performance in more
>> recent versions which would benefit here as well as this being an area
>> where more performance can be extracted although this is a work item
>> which has been in the top priorities AFAIK.
>
> Yeah, it's a shame the maintainers for OpenSM on Windows are MIA.  It's probably been a year since Windows OpenSM has been synchronized with Linux OpenSM.

I think it's more like 2 years now.

>>>  I keep running into issues where the SM can't handle the PR query
>>> traffic when running large MPI jobs.
>>
>> How large is large ?
>
> 512 processes, all try to connect to one another simultaneously (all point-to-point transfers rather than using some all-to-all MPI operation).  So you have an O(n^2) problem.
>
>>>  In many cases, it even falls over and dies.
>>
>> What fails over and dies ?
>
> OpenSM.  Crashes.

What's the crash ? May already have been solved though.

>>> Not a pretty situation.  SM transaction rate (or lack thereof) is
>>> what causes IBAT to fail, and sometimes causes IPoIB to fail.
>>>
>>> The failure sequence for IPoIB goes something like this:
>>>
>>> 1. IPoIB gets an ARP request, reports it to Windows. 2. Windows sends
>>> an ARP response to IPoIB. 3. IPoIB needs to create an AV to send the
>>> ARP response.  It queries the SM for a PR so that it can fill the AV
>>> attributes. 4. PR query times out, IPoIB tells Windows it's hung. 5.
>>> Windows resets IPoIB 6. IPoIB does a port info query to the SA 7. Port
>>> info query times out 8. IPoIB gives up, logs an event to the event log,
>>> and goes into a 'cable disconnected' state.  It remains in this state
>>> until it gets a SM reregister request.
>>
>> I'm not sure PR timeout should result in end stack causing a "cable
>> disconnect" which likely makes things even worse. I forget how this is
>> dealt with in Linux.
>
> Yes, it does indeed make things worse. :)  I experimented with increasing the timeout, and changing it to an exponential backoff, and while path queries didn't time out anymore, the network stack times out waiting for an ARP response, because the ARP response is delayed waiting for a path query.

The PR timeout must be less than the ARP timeout as the PR lookup is
buried in the latter.

>>> At this point, the node is unusable for IB traffic.  You have to
>>> either: a). restart OpenSM so that every node gets a SM reregister
>>> request
>>
>> Shouldn't a cable disconnect cause a individual SM reregister to that
>> one node ? Sounds like a bug in this old Windows OpenSM. I know there
>> were some fixes around this in later (Linux) OpenSMs.
>
> Well, the SM doesn't think there's anything wrong with the node.  The case where the node is too busy to respond to the SM and the SM takes the node out of the fabric will result in a SM reregister for that node once it becomes responsive to QP0 MADs.  The reverse, where the SM is too busy and the client times out doesn't have an elegant solution that I can think of other than just waiting and trying again later (which IPoIB doesn't do beyond a 10 second timeout).

Afraid so.

>
>>> b). disable/enable IPoIB so that it tries again (hopefully with better
>>> luck)
>>>
>>> So I've been making some changes to minimize the dependency on the
>>> SM/SA at runtime, beyond initial configuration.
>>
>> There are some dangers going down this path. The Linux SA cache has
>> yet to be pushed upstream AFAIK.
>
> Right, I'm trying to identify and understand the dangers.  The SA as a single point of data (and failure) seems like a lousy architectural choice.

Well, part of that comment is based on a really old and poor
performing version of OpenSM. Also, more performance can be squeezed
out of SA but it would still hit some transaction rate limits based on
cluster size via more granular locking, multiprocessor, etc. Beyond
that, nothing precludes a distributed SA but that is non trivial to
architect/design.

> Everyone trying to scale things ends up doing whatever they can to avoid the SM/SA because it limits scalability, whether it is static configuration, using sockets to establish connection, caching path queries, what have you.  It all comes back to the SA not giving the performance that is needed.

See above for what I believe is the safest approach.

>>> Phase 1 of the change is to eliminate path record queries from IPoIB.
>>> Using the information from an ARP request or a work completion, along
>>> with information from the broadcast group, I can create address vectors
>>> without having to go chat with the SM.  This has shown good results so
>>> far.
>>
>> I don't think that works in the most general case.
>
> Is it just the GRH issue?  If something else, what?

It's the GRH issues (flow label, traffic class, hop limit) as well as
the possibly reduced static rate.

>>> Here's where I get the various AV attribute parameters:
>>> Service Level: broadcast group
>>> DLID: Work Completion
>>> GRH Hop Limit: broadcast group
>>> GRH Flow Label: broadcast group
>>> GRH traffic class: broadcast group
>>
>> The GRH fields above may not be right in the most general case.
>
> Can you explain in a bit more detail?  If the fields are wrong, are they broken wrong, or just sub-optimal wrong?

I think they could be wrong but it remains a matter for the router
spec to be completed.

>>> GRH destination GID: endpoint (from ARP request)
>>> GRH source GID: create subnet local GID using port GUID
>>> Static Rate: broadcast group
>>  That might be overly restrictive (it would still work at a reduced
>> rate).
>
> Yes, this does restrict all communication to the speed of the MC group.  Still, slow and reliable is better than fast and unstable.
>
>>> The source and destination GIDs are the only fields that are used
>>> identically as the current path record based mechanism.
>>>
>>> Can anyone find any pitfalls in doing this?  I understand that
>>> bandwidth will be limited to the speed of the broadcast group, and I
>>> think I'm OK with that because generally the fabric is homogeneous, and
>>> MC group rate problems are indicative of some configuration or fabric
>>> issue.  What happens to MC rate when you have IB routers in the mix?
>>
>> I think that's an interesting problem yet to be solved but I would
>> expect the SA to return the correct static rate (for multicast) even
>> in the case where subnets are crossed.
>
> So the MC group static rate would be the floor for communication for anyone in the group then.

I think so; if the path for anyone to the MC group is some minimum,
then the unicast
path between any two members would have to be at least that minimum.

> That does what I need it to then.

>>> Now onto Phase 2: Making IPoIB generate path records for IBAT clients
>>> rather than going to the SM.  The problem here is that if you have
>>> enough clients all going to the SM at the same time, they all end up in
>>> a situation where their queries timeout and they retry.  I've gone to
>>> an exponential backoff for retries and even with a maximum retry count
>>> of 2 minutes things never got past querying for path records.  A local
>>> PR cache would help here, and that's another option, but then you have
>>> issues with stale entries etc.
>>
>> Yes, that's what was implemented for Linux and not pushed upstream
>> AFAIK but I think it's in OFED but could be wrong.
>
> Right, I think a cache is more complicated than rolling path records in IPoIB.

Yes, it's more complicated but can cover more cases.

>>> Reversible: Hard code to 1
>>
>> This is for MADs but not required for IPoIB (UD or RC or ...).
>
> So path records for UD and RC don't need to be reversible?  Is zero thus a better hard coded value?
>
>>> NumbPath: Hard code to 1
>>
>> only needed for get table not get
>
> When a client does a GET, what does the SA set this to?  I was just trying to find reasonable values rather than leaving things uninitialized.

Such components are usually set to 0 usually by first clearing entire
request MAD but it doesn't matter as the SA will ignore the field with
a Get (but not with a GetTable where this component is meaningful).

-- Hal

>>> PKey: Same as IPoIB port object
>>> MTU: broadcast group
>>> Rate: broadcast group
>>> Packet Life: broadcast group
>>
>> Similar issue with Packet Life as with rate since the broadcast value
>> might not be same as unicast (PR).
>
> Right.
>
> Thanks for the feedback!
> -Fab
>

***SPAM*** Re: [ofw] [RFC] Locally generated path records

SPAM Re: [ofw] [RFC] Locally generated path records