[ofw] [RFC] Locally generated path records

Fab Tillier ftillier at windows.microsoft.com
Tue Jul 15 12:23:49 PDT 2008


How smart is the SM when it generates path records?  I keep running into issues where the SM can't handle the PR query traffic when running large MPI jobs.  In many cases, it even falls over and dies.  Not a pretty situation.  SM transaction rate (or lack thereof) is what causes IBAT to fail, and sometimes causes IPoIB to fail.  The failure sequence for IPoIB goes something like this:

1. IPoIB gets an ARP request, reports it to Windows.
2. Windows sends an ARP response to IPoIB.
3. IPoIB needs to create an AV to send the ARP response.  It queries the SM for a PR so that it can fill the AV attributes.
4. PR query times out, IPoIB tells Windows it's hung.
5. Windows resets IPoIB
6. IPoIB does a port info query to the SA
7. Port info query times out
8. IPoIB gives up, logs an event to the event log, and goes into a 'cable disconnected' state.  It remains in this state until it gets a SM reregister request.

At this point, the node is unusable for IB traffic.  You have to either:
a). restart OpenSM so that every node gets a SM reregister request
b). disable/enable IPoIB so that it tries again (hopefully with better luck)

So I've been making some changes to minimize the dependency on the SM/SA at runtime, beyond initial configuration.

Phase 1 of the change is to eliminate path record queries from IPoIB.  Using the information from an ARP request or a work completion, along with information from the broadcast group, I can create address vectors without having to go chat with the SM.  This has shown good results so far.

Here's where I get the various AV attribute parameters:
Service Level: broadcast group
DLID: Work Completion
GRH Hop Limit: broadcast group
GRH Flow Label: broadcast group
GRH traffic class: broadcast group
GRH destination GID: endpoint (from ARP request)
GRH source GID: create subnet local GID using port GUID
Static Rate: broadcast group

The source and destination GIDs are the only fields that are used identically as the current path record based mechanism.

Can anyone find any pitfalls in doing this?  I understand that bandwidth will be limited to the speed of the broadcast group, and I think I'm OK with that because generally the fabric is homogeneous, and MC group rate problems are indicative of some configuration or fabric issue.  What happens to MC rate when you have IB routers in the mix?

Now onto Phase 2: Making IPoIB generate path records for IBAT clients rather than going to the SM.  The problem here is that if you have enough clients all going to the SM at the same time, they all end up in a situation where their queries timeout and they retry.  I've gone to an exponential backoff for retries and even with a maximum retry count of 2 minutes things never got past querying for path records.  A local PR cache would help here, and that's another option, but then you have issues with stale entries etc.  So I'd rather generate path records in IPoIB, where stale information is less likely (since the system will resend an ARP if a sufficient time interval has gone by).

To create a path record, IPoIB needs the following values (in addition to the ones it has access to for the AV creation):
SLID: Can be stored in IPoIB port object (__endpt_mgr_add_local gets it)
Reversible: Hard code to 1
NumbPath: Hard code to 1
PKey: Same as IPoIB port object
MTU: broadcast group
Rate: broadcast group
Packet Life: broadcast group
Preference: 0

Again, any glaring issues with doing this?

Thanks,
-Fab





More information about the ofw mailing list