Thanks for the information -- I have some follow-on inline below.<br><br><div class="gmail_quote">On Wed, Jul 8, 2009 at 2:37 AM, Sean Hefty <span dir="ltr"><<a href="mailto:sean.hefty@intel.com">sean.hefty@intel.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">>We are trying to use OpenMPI 1.3.2 with rdma_cm support on an Infiniband fabric<br>


>using OFED 1.4.1.  When the MPI jobs get large enough, the event response to<br>

>rdma_resolve_route becomes RDMA_CM_EVENT_ROUTE_ERROR with a status of<br>

>ETIMEDOUT.<br>

<br>

</div>Yep - you pretty much need to connect out of band with all large MPI jobs using<br>

made up path data, or enable some sort of PR caching.</blockquote><div><br>I should have mentioned that the fabric is a large torus using LASH routing, and we need to get the live SL value to make deadlock-free connections.  We are definitely thinking about PR caching, but that raises issues about how to manage the life of the cache entries.<br>

 <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>

<div class="im"><br>

>It seems pretty clear that the SA path record requests are being synchronized<br>

>and bunching together, and in the end exhausting the resources of the subnet<br>

>manager node so only the first N are actually received.<br>

<br>

</div>In our testing, we discovered that the SA almost never dropped any queries.  The<br>

problem was that the backlog grew so huge, that all requests had timed out<br>

before they could be acted on.  There's probably something that could be done<br>

here to avoid storing received MADs for extended periods of time.</blockquote><div><br>This is encouraging.  I did try testing with 10,000 ms timeouts and still got the failure with only 800 different processes, so I jumped to the conclusion that the queries were being dropped.  Do you have a guess as to a timeout value that would always succeed?<br>

<br>Also, your testing suggests that the receive queue almost never gets exhausted.  At least as I understand things, if the queue ends up empty then the HCA can dump packets at great speed.  How does the system cope with a potential stream of requests arriving less than half a microsecond apart?  (I should have mentioned that the fabric is QDR.)  I guess this is another way of asking my question about how to maximize the ability of the subnet manager node to accept requests.<br>

 <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>

<div class="im"><br>

>The sequence seems to be:<br>

><br>

>call librdmacm-1.0.8/src/cma.c's rdma_resolve_route<br>

><br>

>which translates directly into a kernel call into infiniband/core/cma.c's<br>

>rdma_resolve_route<br>

><br>

>with an IB fabric becomes a call into cma_resolve_ib_route<br>

><br>

>which leads to a call to cma_query_ib_route<br>

><br>

>which gets to calling infiniband/core/sa_query.c's ib_sa_path_rec_get with the<br>

>callback pointing to cma_query_handler<br>

><br>

>When cma_query_handler gets a callaback with a bad status, it sets the returned<br>

>event to RDMA_CM_EVENT_ROUTE_ERROR<br>

><br>

>Nowhere in there do I see any retry attempts.  If the SA path record query<br>

>packet, or it's response packet, gets lost, then the timeout eventually happens<br>

>and we see RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT.<br>

<br>

</div>The kernel sa_query module does not issue retries.  All retries are the<br>

responsibility of the caller.  This gives greater flexibility to how timeouts<br>

are handled, but has the drawback that all 'retries' are really new<br>

transactions.<br>

<div class="im"><br>

>First question: Did I miss a retry buried somewhere in all of that?<br>

<br>

</div>I don't believe so.</blockquote><div><br>Thanks for the confirmation.  There have been several people telling me that it is in there, and I couldn't find it.<br> <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<br>

<div class="im"><br>

>Second question: How does somebody come up with a timeout value that makes<br>

>sense?  Assuming retries are the responsibility of the rdma_resolve_route<br>

>caller, you would like to have a value that is long enough to avoid false<br>

>timeouts when a response is eventually going to make it, but not any longer.<br>

>This value seems like it would be dependent on the fabric and the capabilities<br>

>of the node running the subnet manager, and should be a fabric-specific<br>

>parameter instead of something chosen at random by each caller of<br>

>rdma_resolve_route.<br>

<br>

</div>The timeout is also dependent on the load hitting the SA.  I don't know that a<br>

fabric-specific parameter can work.</blockquote><div><br>Maybe I should have come up with a better name.  By fabric-specific, I meant a specific implentation of the fabric, including the capability of the subnet manager node.  How does somebody writing rdma_cm code come up with a number?  That particular program might not put much of a load on the SA, but could run concurrently with other jobs that do (or don't).  It would be nice to have a way to set up the retry mechanism so that it would work on any system it ran on.<br>

 <br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>

<font color="#888888"><br>

- Sean<br>

</font></blockquote></div><br>