[openib-general] ucma into kernel.org

Or Gerlitz ogerlitz at voltaire.com
Thu Jul 6 01:05:43 PDT 2006


Michael S. Tsirkin wrote:
> It's a problem, I agree, but hard-coding timeouts still does not make sense
> to me - I honestly don't see how will an application know which value to
> use here, since the roundtrip really depends on the topology.

> Any ideas on how this can be handled correctly? Does CMA at least back off
> exponentially on timeout?

 From our experience on order K nodes cluster, we did not have issues 
with CM traffic, but: the CM traffic was not NxN but rather NxM where N 
was (say) 1K and M was (say) 16, the app being cluster file system  - 
Lustre /VIBNAL which is the Lustre IB layer for the voltaire gen1 stack.

As for NxN CM/CMA consumers, i recall it has been mentioned on this list 
that CM timeouts/retries had to be changed to have (say) N=128 nodes 
(ranks?) operating fine with Intel MPI using uDAPL.

Sean - have you been into the loop of analyzing /debugging @ this site?

Can you confirm **this** was the issue which made the setup broken and 
working when you enlarged/changed things (what? and from which value to 
which value?)

Without any relevant (non) use case i don't think there's a need to 
spend energy on code to generate the correct timeouts/retries for this 
or that setting.

Or.








More information about the general mailing list