[ofw] RE: ipoib connection timeout

Fab Tillier ftillier at windows.microsoft.com
Wed Sep 17 13:28:26 PDT 2008


>Hi,
>
>We recently found that on several systems, different os with different
>hca's ipoib is not able to establish connection due to some timeout.
>Once the hca was disabled and enabled (in device manager) the problem
>was gone. We have a very busy infiniband network: many nodes connected
>and tests running 24x7, but this is nothing compared to client's
>network.
>I think this situation requires better handling, message in system log
>(see below) is not enough. Maybe something repetitive that sends this
>query every few seconds as long as connection is not established when it
>should be. Any thoughts?

IPoIB allows 10 seconds (1 second timeouts, 10 retries) by default to hear back from the SM.  Even if you get past this issue, you will likely run into the same timeouts when querying for paths to respond to ARP requests.  While you maybe able to do something internally to IPoIB or IBAL to exponentially back off for these queries, the OS will not give you more time to get a response from the SM, and the ARP resolution will timeout.

In my experience, this issue is related to the SA not being in sync with the topology recently discovered by the SM.  What happens is that IPoIB will issue the port info query as soon as the IB port is up (SM moved port to active state), but the SA doesn't have a record for the port yet.  The SM should update the SA's topology before bringing the ports active for things to work properly.

The reason disable/enable solves the issue is that by the time IPoIB is enabled again, the SA's topology matches the SM's (there's more of a delay with IPoIB being reported and the SM simultaneously bringing the HCA's port up).  You can get the same result just by disabling/enabling IPoIB.

You could add a delay when the port first comes up and likely see things work properly.  Any such delay should really be implemented in IBAL or in the HCA driver, though ideally the SM would synchronize with the SA earlier.

-Fab



More information about the ofw mailing list