***SPAM*** Re: [ofw] [RFC] Locally generated path records

Hal Rosenstock hal.rosenstock at gmail.com
Mon Jul 21 11:39:34 PDT 2008


Hi Fab,

On Tue, Jul 15, 2008 at 3:23 PM, Fab Tillier
<ftillier at windows.microsoft.com> wrote:
> How smart is the SM when it generates path records?

Not sure what you mean by "smart". Also, is the issue below with
OpenSM and if so, which version (Windows or Linux) ? If this is
Windows OpenSM, there are improvements in SA performance in more
recent versions which would benefit here as well as this being an area
where more performance can be extracted although this is a work item
which has been in the top priorities AFAIK.

>  I keep running into issues where the SM can't handle the PR query traffic when running large MPI jobs.

How large is large ?

>  In many cases, it even falls over and dies.

What fails over and dies ?

> Not a pretty situation.  SM transaction rate (or lack thereof) is what causes IBAT to fail, and sometimes causes IPoIB to fail.

> The failure sequence for IPoIB goes something like this:
>
> 1. IPoIB gets an ARP request, reports it to Windows.
> 2. Windows sends an ARP response to IPoIB.
> 3. IPoIB needs to create an AV to send the ARP response.  It queries the SM for a PR so that it can fill the AV attributes.
> 4. PR query times out, IPoIB tells Windows it's hung.
> 5. Windows resets IPoIB
> 6. IPoIB does a port info query to the SA
> 7. Port info query times out
> 8. IPoIB gives up, logs an event to the event log, and goes into a 'cable disconnected' state.  It remains in this state until it gets a SM reregister request.

I'm not sure PR timeout should result in end stack causing a "cable
disconnect" which likely makes things even worse. I forget how this is
dealt with in Linux.

> At this point, the node is unusable for IB traffic.  You have to either:
> a). restart OpenSM so that every node gets a SM reregister request

Shouldn't a cable disconnect cause a individual SM reregister to that
one node ? Sounds like a bug in this old Windows OpenSM. I know there
were some fixes around this in later (Linux) OpenSMs.

> b). disable/enable IPoIB so that it tries again (hopefully with better luck)
>
> So I've been making some changes to minimize the dependency on the SM/SA at runtime, beyond initial configuration.

There are some dangers going down this path. The Linux SA cache has
yet to be pushed upstream AFAIK.

> Phase 1 of the change is to eliminate path record queries from IPoIB.  Using the information from an ARP request or a work completion, along with information from the broadcast group, I can create address vectors without having to go chat with the SM.  This has shown good results so far.

I don't think that works in the most general case.

> Here's where I get the various AV attribute parameters:
> Service Level: broadcast group
> DLID: Work Completion
> GRH Hop Limit: broadcast group
> GRH Flow Label: broadcast group
> GRH traffic class: broadcast group

The GRH fields above may not be right in the most general case.

> GRH destination GID: endpoint (from ARP request)
> GRH source GID: create subnet local GID using port GUID
> Static Rate: broadcast group

That might be overly restrictive (it would still work at a reduced rate).

> The source and destination GIDs are the only fields that are used identically as the current path record based mechanism.
>
> Can anyone find any pitfalls in doing this?  I understand that bandwidth will be limited to the speed of the broadcast group, and I think I'm OK with that because generally the fabric is homogeneous, and MC group rate problems are indicative of some configuration or fabric issue.  What happens to MC rate when you have IB routers in the mix?

I think that's an interesting problem yet to be solved but I would
expect the SA to return the correct static rate (for multicast) even
in the case where subnets are crossed.

> Now onto Phase 2: Making IPoIB generate path records for IBAT clients rather than going to the SM.  The problem here is that if you have enough clients all going to the SM at the same time, they all end up in a situation where their queries timeout and they retry.  I've gone to an exponential backoff for retries and even with a maximum retry count of 2 minutes things never got past querying for path records.  A local PR cache would help here, and that's another option, but then you have issues with stale entries etc.

Yes, that's what was implemented for Linux and not pushed upstream
AFAIK but I think it's in OFED but could be wrong.

> So I'd rather generate path records in IPoIB, where stale information is less likely (since the system will resend an ARP if a sufficient time interval has gone by).
>
> To create a path record, IPoIB needs the following values (in addition to the ones it has access to for the AV creation):
> SLID: Can be stored in IPoIB port object (__endpt_mgr_add_local gets it)
> Reversible: Hard code to 1

This is for MADs but not required for IPoIB (UD or RC or ...).

> NumbPath: Hard code to 1

only needed for get table not get

> PKey: Same as IPoIB port object
> MTU: broadcast group
> Rate: broadcast group
> Packet Life: broadcast group

Similar issue with Packet Life as with rate since the broadcast value
might not be same as unicast (PR).

-- Hal

> Preference: 0
>
> Again, any glaring issues with doing this?
>
> Thanks,
> -Fab
>
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>



More information about the ofw mailing list