[ofa-general] RE: Question on rdma_resolve_route and retries

Wed Jul 8 00:37:26 PDT 2009

>We are trying to use OpenMPI 1.3.2 with rdma_cm support on an Infiniband fabric
>using OFED 1.4.1.  When the MPI jobs get large enough, the event response to
>rdma_resolve_route becomes RDMA_CM_EVENT_ROUTE_ERROR with a status of
>ETIMEDOUT.

Yep - you pretty much need to connect out of band with all large MPI jobs using
made up path data, or enable some sort of PR caching.

>It seems pretty clear that the SA path record requests are being synchronized
>and bunching together, and in the end exhausting the resources of the subnet
>manager node so only the first N are actually received.

In our testing, we discovered that the SA almost never dropped any queries.  The
problem was that the backlog grew so huge, that all requests had timed out
before they could be acted on.  There's probably something that could be done
here to avoid storing received MADs for extended periods of time.

>The sequence seems to be:
>
>call librdmacm-1.0.8/src/cma.c's rdma_resolve_route
>
>which translates directly into a kernel call into infiniband/core/cma.c's
>rdma_resolve_route
>
>with an IB fabric becomes a call into cma_resolve_ib_route
>
>which leads to a call to cma_query_ib_route
>
>which gets to calling infiniband/core/sa_query.c's ib_sa_path_rec_get with the
>callback pointing to cma_query_handler
>
>When cma_query_handler gets a callaback with a bad status, it sets the returned
>event to RDMA_CM_EVENT_ROUTE_ERROR
>
>Nowhere in there do I see any retry attempts.  If the SA path record query
>packet, or it's response packet, gets lost, then the timeout eventually happens
>and we see RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT.

The kernel sa_query module does not issue retries.  All retries are the
responsibility of the caller.  This gives greater flexibility to how timeouts
are handled, but has the drawback that all 'retries' are really new
transactions.

>First question: Did I miss a retry buried somewhere in all of that?

I don't believe so.

>Second question: How does somebody come up with a timeout value that makes
>sense?  Assuming retries are the responsibility of the rdma_resolve_route
>caller, you would like to have a value that is long enough to avoid false
>timeouts when a response is eventually going to make it, but not any longer.
>This value seems like it would be dependent on the fabric and the capabilities
>of the node running the subnet manager, and should be a fabric-specific
>parameter instead of something chosen at random by each caller of
>rdma_resolve_route.

The timeout is also dependent on the load hitting the SA.  I don't know that a
fabric-specific parameter can work.

- Sean