[ofa-general] RE: Question on rdma_resolve_route and retries

Wed Jul 8 11:14:03 PDT 2009

>This is encouraging.  I did try testing with 10,000 ms timeouts and still got
>the failure with only 800 different processes, so I jumped to the conclusion
>that the queries were being dropped.  Do you have a guess as to a timeout value
>that would always succeed?

We ended up around a 60 second timeout based on the number of connections and
how quickly our SM node could process queries.  This was done a while ago, and
there have been a lot of improvements to opensm since then.  I don't know of an
easy way to test the performance of the SM.  It's also possible that our test
staggered the queries just enough that the SM could keep up receiving them.

>Maybe I should have come up with a better name.  By fabric-specific, I meant a
>specific implentation of the fabric, including the capability of the subnet
>manager node.  How does somebody writing rdma_cm code come up with a number?
>That particular program might not put much of a load on the SA, but could run
>concurrently with other jobs that do (or don't).  It would be nice to have a
>way to set up the retry mechanism so that it would work on any system it ran
>on.

Maybe the SA service could track the SA response time and adjust the timeout
accordingly.  E.g. guess = .2(last response) + .8(last guess).  Users could
indicate that the default timeout could be used.

Apps could also help by staggering their start times to avoid hitting the SA
with hundreds of thousands of queries at once.

- Sean