<br>We are trying to use OpenMPI 1.3.2 with rdma_cm support on an Infiniband fabric using OFED 1.4.1. When the MPI jobs get large enough, the event response to rdma_resolve_route becomes RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT.<br>
<br>It seems pretty clear that the SA path record requests are being synchronized and bunching together, and in the end exhausting the resources of the subnet manager node so only the first N are actually received.<br><br>
The sequence seems to be:<br><br>call librdmacm-1.0.8/src/cma.c's rdma_resolve_route<br><br>which translates directly into a kernel call into infiniband/core/cma.c's rdma_resolve_route<br><br>with an IB fabric becomes a call into cma_resolve_ib_route<br>
<br>which leads to a call to cma_query_ib_route<br><br>which gets to calling infiniband/core/sa_query.c's ib_sa_path_rec_get with the callback pointing to cma_query_handler<br><br>When cma_query_handler gets a callaback with a bad status, it sets the returned event to RDMA_CM_EVENT_ROUTE_ERROR<br>
<br>Nowhere in there do I see any retry attempts. If the SA path record query packet, or it's response packet, gets lost, then the timeout eventually happens and we see RDMA_CM_EVENT_ROUTE_ERROR with a status of ETIMEDOUT.<br>
<br>First question: Did I miss a retry buried somewhere in all of that?<br><br>Second question: How does somebody come up with a timeout value that makes sense? Assuming retries are the responsibility of the rdma_resolve_route caller, you would like to have a value that is long enough to avoid false timeouts when a response is eventually going to make it, but not any longer. This value seems like it would be dependent on the fabric and the capabilities of the node running the subnet manager, and should be a fabric-specific parameter instead of something chosen at random by each caller of rdma_resolve_route.<br>
<br>There is probably some interesting discussion to have around the amount of time that the rdma_resolve_route caller should wait after the failure before retrying, so that time could be added to the base timeout and simplify the processing. This duration might also be different for each node and iteration of the retry in an attempt to avoid wave after wave of multiple requestors overwhelming the subnet manager.<br>
<br>There is also the question of how many times this needs to be repeated before the rdma_resolve_route caller declares complete failure. Perhaps this is also a fabric-specific parameter?<br><br>Finally, is there some way to tune the subnet manager node so that the number of requests that can be captured and processed is maximized?<br>
<br>Thanks for any help or ideas.<br><br>Dave McMillen<br><br><br><br>