[ofa-general] Re: Question on rdma_resolve_route and retries

Wed Jul 8 00:57:42 PDT 2009

Thanks for the information -- I have some follow-on inline below.

On Wed, Jul 8, 2009 at 2:37 AM, Sean Hefty <sean.hefty at intel.com> wrote:

> >We are trying to use OpenMPI 1.3.2 with rdma_cm support on an Infiniband
> fabric
> >using OFED 1.4.1.  When the MPI jobs get large enough, the event response
> to
> >rdma_resolve_route becomes RDMA_CM_EVENT_ROUTE_ERROR with a status of
> >ETIMEDOUT.
>
> Yep - you pretty much need to connect out of band with all large MPI jobs
> using
> made up path data, or enable some sort of PR caching.

I should have mentioned that the fabric is a large torus using LASH routing,
and we need to get the live SL value to make deadlock-free connections.  We
are definitely thinking about PR caching, but that raises issues about how
to manage the life of the cache entries.

>
>
> >It seems pretty clear that the SA path record requests are being
> synchronized
> >and bunching together, and in the end exhausting the resources of the
> subnet
> >manager node so only the first N are actually received.
>
> In our testing, we discovered that the SA almost never dropped any queries.
>  The
> problem was that the backlog grew so huge, that all requests had timed out
> before they could be acted on.  There's probably something that could be
> done
> here to avoid storing received MADs for extended periods of time.

This is encouraging.  I did try testing with 10,000 ms timeouts and still
got the failure with only 800 different processes, so I jumped to the
conclusion that the queries were being dropped.  Do you have a guess as to a
timeout value that would always succeed?

Also, your testing suggests that the receive queue almost never gets
exhausted.  At least as I understand things, if the queue ends up empty then
the HCA can dump packets at great speed.  How does the system cope with a
potential stream of requests arriving less than half a microsecond apart?
(I should have mentioned that the fabric is QDR.)  I guess this is another
way of asking my question about how to maximize the ability of the subnet
manager node to accept requests.

>
>
> >The sequence seems to be:
> >
> >call librdmacm-1.0.8/src/cma.c's rdma_resolve_route
> >
> >which translates directly into a kernel call into infiniband/core/cma.c's
> >rdma_resolve_route
> >
> >with an IB fabric becomes a call into cma_resolve_ib_route
> >
> >which leads to a call to cma_query_ib_route
> >
> >which gets to calling infiniband/core/sa_query.c's ib_sa_path_rec_get with
> the
> >callback pointing to cma_query_handler
> >
> >When cma_query_handler gets a callaback with a bad status, it sets the
> returned
> >event to RDMA_CM_EVENT_ROUTE_ERROR
> >
> >Nowhere in there do I see any retry attempts.  If the SA path record query
> >packet, or it's response packet, gets lost, then the timeout eventually
> happens
> >and we see RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT.
>
> The kernel sa_query module does not issue retries.  All retries are the
> responsibility of the caller.  This gives greater flexibility to how
> timeouts
> are handled, but has the drawback that all 'retries' are really new
> transactions.
>
> >First question: Did I miss a retry buried somewhere in all of that?
>
> I don't believe so.

Thanks for the confirmation.  There have been several people telling me that
it is in there, and I couldn't find it.

>
>
> >Second question: How does somebody come up with a timeout value that makes
> >sense?  Assuming retries are the responsibility of the rdma_resolve_route
> >caller, you would like to have a value that is long enough to avoid false
> >timeouts when a response is eventually going to make it, but not any
> longer.
> >This value seems like it would be dependent on the fabric and the
> capabilities
> >of the node running the subnet manager, and should be a fabric-specific
> >parameter instead of something chosen at random by each caller of
> >rdma_resolve_route.
>
> The timeout is also dependent on the load hitting the SA.  I don't know
> that a
> fabric-specific parameter can work.

Maybe I should have come up with a better name.  By fabric-specific, I meant
a specific implentation of the fabric, including the capability of the
subnet manager node.  How does somebody writing rdma_cm code come up with a
number?  That particular program might not put much of a load on the SA, but
could run concurrently with other jobs that do (or don't).  It would be nice
to have a way to set up the retry mechanism so that it would work on any
system it ran on.

>
>
> - Sean
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090708/9b99c1de/attachment.html>