[ofw] RE: ipoib connection timeout

Hal Rosenstock hal.rosenstock at gmail.com
Thu Sep 18 03:26:03 PDT 2008


Hi Fab,

On Wed, Sep 17, 2008 at 4:28 PM, Fab Tillier
<ftillier at windows.microsoft.com> wrote:
>>Hi,
>>
>>We recently found that on several systems, different os with different
>>hca's ipoib is not able to establish connection due to some timeout.
>>Once the hca was disabled and enabled (in device manager) the problem
>>was gone. We have a very busy infiniband network: many nodes connected
>>and tests running 24x7, but this is nothing compared to client's
>>network.
>>I think this situation requires better handling, message in system log
>>(see below) is not enough. Maybe something repetitive that sends this
>>query every few seconds as long as connection is not established when it
>>should be. Any thoughts?
>
> IPoIB allows 10 seconds (1 second timeouts, 10 retries) by default to hear back from the SM.  Even if you get past this issue, you will likely run into the same timeouts when querying for paths to respond to ARP requests.  While you maybe able to do something internally to IPoIB or IBAL to exponentially back off for these queries, the OS will not give you more time to get a response from the SM, and the ARP resolution will timeout.
>
> In my experience, this issue is related to the SA not being in sync with the topology recently discovered by the SM.

How did you determine this ?

What SM ? If OpenSM, which version ? Is it recent ?

At least in terms of OpenSM, I'm not sure what you mean by in sync
with recent discovered topology as the SA and SM share the same data.

>  What happens is that IPoIB will issue the port info query as soon as the IB port is up (SM moved port to active state), but the SA doesn't have a record for the port yet.  The SM should update the SA's topology before bringing the ports active for things to work properly.

When you say port is up do you mean PhysicalPortState or PortState ?

At least for OpenSM, once the port is discovered by the SM, it would
be reported in a SA Get or GetTable PortInfoRecord. There is a window
between when the PhysicalPortState is LinkUp and the SM discovers it.

-- Hal

> The reason disable/enable solves the issue is that by the time IPoIB is enabled again, the SA's topology matches the SM's (there's more of a delay with IPoIB being reported and the SM simultaneously bringing the HCA's port up).  You can get the same result just by disabling/enabling IPoIB.
>
> You could add a delay when the port first comes up and likely see things work properly.  Any such delay should really be implemented in IBAL or in the HCA driver, though ideally the SM would synchronize with the SA earlier.
>
> -Fab
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>



More information about the ofw mailing list