[ofa-general] Question on rdma_resolve_route and retries

Wed Jul 8 00:18:01 PDT 2009

We are trying to use OpenMPI 1.3.2 with rdma_cm support on an Infiniband
fabric using OFED 1.4.1.  When the MPI jobs get large enough, the event
response to rdma_resolve_route becomes RDMA_CM_EVENT_ROUTE_ERROR with a
status of ETIMEDOUT.

It seems pretty clear that the SA path record requests are being
synchronized and bunching together, and in the end exhausting the resources
of the subnet manager node so only the first N are actually received.

The sequence seems to be:

call librdmacm-1.0.8/src/cma.c's rdma_resolve_route

which translates directly into a kernel call into infiniband/core/cma.c's
rdma_resolve_route

with an IB fabric becomes a call into cma_resolve_ib_route

which leads to a call to cma_query_ib_route

which gets to calling infiniband/core/sa_query.c's ib_sa_path_rec_get with
the callback pointing to cma_query_handler

When cma_query_handler gets a callaback with a bad status, it sets the
returned event to RDMA_CM_EVENT_ROUTE_ERROR

Nowhere in there do I see any retry attempts.  If the SA path record query
packet, or it's response packet, gets lost, then the timeout eventually
happens and we see RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT.

First question: Did I miss a retry buried somewhere in all of that?

Second question: How does somebody come up with a timeout value that makes
sense?  Assuming retries are the responsibility of the rdma_resolve_route
caller, you would like to have a value that is long enough to avoid false
timeouts when a response is eventually going to make it, but not any
longer.  This value seems like it would be dependent on the fabric and the
capabilities of the node running the subnet manager, and should be a
fabric-specific parameter instead of something chosen at random by each
caller of rdma_resolve_route.

There is probably some interesting discussion to have around the amount of
time that the rdma_resolve_route caller should wait after the failure before
retrying, so that time could be added to the base timeout and simplify the
processing.  This duration might also be different for each node and
iteration of the retry in an attempt to avoid wave after wave of multiple
requestors overwhelming the subnet manager.

There is also the question of how many times this needs to be repeated
before the rdma_resolve_route caller declares complete failure.  Perhaps
this is also a fabric-specific parameter?

Finally, is there some way to tune the subnet manager node so that the
number of requests that can be captured and processed is maximized?

Thanks for any help or ideas.

Dave McMillen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090708/74cebf40/attachment.html>