***SPAM*** Re: [ofw] [IPoIB] Problem with "Avoid the SM" patch

Fab Tillier ftillier at windows.microsoft.com
Thu Sep 18 09:03:59 PDT 2008


>> I understood that Fab checked this issue (by 10 retries of 1 second TO)
>> and found that it didn't help there. Yet another try can be enlarging
>> the TO to be 5 sec and sending less retries
>
> I think some exponential backoff strategy with some randomization
> might be better.

The problem with this is that the layers above IPoIB (namely the network stack generating ARP requests and expecting ARP responses) doesn't have visibility into this backoff strategy, and will give up on an ARP request if the response doesn't come back in time.  The response could be delayed for a long time if the SM isn't responding to queries in a timely manner, since IPoIB needs to resolve the path in order to send the unicast response.  I don't know the timeout for an ARP response, but I'd be surprised if it was 10 seconds, let alone whatever you would get with exponential backoff.

I initially tried exponential backoff to resolve the problem I was seeing with these MPI apps, and it didn't work because of this.  That's when I set out on a path to take the SM out of the equation as much as possible.

-Fab



More information about the ofw mailing list