[openib-general] ucma into kernel.org
Or Gerlitz
ogerlitz at voltaire.com
Thu Jul 6 01:05:43 PDT 2006
Michael S. Tsirkin wrote:
> It's a problem, I agree, but hard-coding timeouts still does not make sense
> to me - I honestly don't see how will an application know which value to
> use here, since the roundtrip really depends on the topology.
> Any ideas on how this can be handled correctly? Does CMA at least back off
> exponentially on timeout?
From our experience on order K nodes cluster, we did not have issues
with CM traffic, but: the CM traffic was not NxN but rather NxM where N
was (say) 1K and M was (say) 16, the app being cluster file system -
Lustre /VIBNAL which is the Lustre IB layer for the voltaire gen1 stack.
As for NxN CM/CMA consumers, i recall it has been mentioned on this list
that CM timeouts/retries had to be changed to have (say) N=128 nodes
(ranks?) operating fine with Intel MPI using uDAPL.
Sean - have you been into the loop of analyzing /debugging @ this site?
Can you confirm **this** was the issue which made the setup broken and
working when you enlarged/changed things (what? and from which value to
which value?)
Without any relevant (non) use case i don't think there's a need to
spend energy on code to generate the correct timeouts/retries for this
or that setting.
Or.
More information about the general
mailing list