[openib-general] ucma into kernel.org

Michael S. Tsirkin mst at mellanox.co.il
Thu Jul 6 01:19:42 PDT 2006


Quoting r. Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: ucma into kernel.org
> 
> Michael S. Tsirkin wrote:
> > It's a problem, I agree, but hard-coding timeouts still does not make sense
> > to me - I honestly don't see how will an application know which value to
> > use here, since the roundtrip really depends on the topology.
> 
> > Any ideas on how this can be handled correctly? Does CMA at least back off
> > exponentially on timeout?
> 
>  From our experience on order K nodes cluster, we did not have issues 
> with CM traffic, but: the CM traffic was not NxN but rather NxM where N 
> was (say) 1K and M was (say) 16, the app being cluster file system  - 
> Lustre /VIBNAL which is the Lustre IB layer for the voltaire gen1 stack.

Not sure what you mean by "did not have issues with CM traffic".  Did no packets
get lost? Did you run any other traffic on the same fabric, concurrently?  I
also don't really see how do gen1 tests have any bearing on gen2 CMA.

> As for NxN CM/CMA consumers, i recall it has been mentioned on this list 
> that CM timeouts/retries had to be changed to have (say) N=128 nodes 
> (ranks?) operating fine with Intel MPI using uDAPL.
> 
> Sean - have you been into the loop of analyzing /debugging @ this site?
> 
> Can you confirm **this** was the issue which made the setup broken and 
> working when you enlarged/changed things (what? and from which value to 
> which value?)

What I am saying that giving the application control over the timeouts
seems more like a workaround than a solution.

> Without any relevant (non) use case i don't think there's a need to 
> spend energy on code to generate the correct timeouts/retries for this 
> or that setting.

I think apps already have control over retry count - witness TCP_SYNCNT.  As for
the timeouts - I think you are right that's why we need something adaptive,
users won't have the energy to tune these per network/application.

-- 
MST




More information about the general mailing list