<br>I am pleased to see the discussions this has raised about possible changes to the underlying kernel services. However, this user-mode patch does address a real problem in the stack as it stands, and I'd like to see agreement that it should be applied. The last (V3) version of the patch makes the default action behave fairly well, and adds command line switches that allow complete control over the timeouts/retries.<br>
<br>I don't know enough about the kernel level code to make comments on the best way to improve it, but I would like to make an observation from the user level. There is no way for the user level to have an informed decision about the proper value for timeouts or retry counts. If you are writing some code (MPI comes to mind) that you know will impose a difficult load and therefore requires special tuning, it actually impacts all other user level code running on the fabric. Nothing allows that special tuning to be known or shared. I think those kinds of changes really need to be taken over by system-wide parameters in the kernel, so much so that perhaps some consideration should be given to ignoring the timeout parameters on rdma_resolve_addr and rdma_resolve_route in favor of kernel parameters.<br>
<br>Dave<br><br>