[openib-general] Local SA caching - why we need it

Thu Nov 30 14:21:29 PST 2006

Or Gerlitz wrote,
On 11/28/06, Woodruff, Robert J <robert.j.woodruff at intel.com> wrote:
> I suppose we should start a new thread for this discussion.
> I have added Arlin and Sean, who have more details about the problems
> that we have seen on a 256 node cluster with connection scaling
> with OFED 1.1 and how the local sa cache helps solve the problem.
> There was already one thread on this issue on the list, but
> suppose we should have the discussion again.

> I will let Sean and Arlin provide the details of why
> an SA cache is needed to allow connection establishment scaling to
very
> large clusters
> since the SA can only handle a limited number of queries per second
> and quickly becomes the bottleneck in trying to establish all-to-all
> communications for MPI or other applications need all-to-all
> communications. Intel MPI already sees this problem on a 256 node
> cluster.

>As i have mentioned to you on the devcon, i think that the discussion
>is not yet in a stage where Arlin has to step in and explain how udapl
>work nor Sean has to step in and state how the rdma cm is implemented
>with/without the local cache.

>What need to be done now is to share/review the ***Intel MPI*** conn
>establishment design. The developers has to come and say ***how*** do
>they execute this all-to-all connection establishment on the job
>start, and why they think this is the optimal way to go (first, not a
>connection establishement on demand model, second why the
>establishment pattern they use is optimal). Same for the Open MPI
>developers, who (Galen, Jeff) also mentioned in the devcon that
>consider to go for using the RDMA CM for connection establishment and
>think there are issues with SA/CM scalability.

>Once these two points and others that might pop during discussion are
>done, we can define the problem (requirements) and seek for solutions.
>These solutions might have the local sa as one building block and they
>might not.

>Or.

This really is not an issue with the Intel MPI connection establishment
design, rather, any application (or set of applications) that needs to
establish lots of connections will have the same issue. As it turns out,
Intel MPI is the first application that
sees  the issue.
I suspect that MVAPICH or HPMPI running over uDAPL/rdma_cm would see a
similar 
problems when trying to scale to large node counts, but I do not know
for sure.

I cannot discuss the internal design on Intel MPI, a proprietary
software
product, but we have observed that this issue with scaling of
connections
exists and have debugged it to know that the bottleneck is the 
amount of SA queries that can be handled per second.  

Again, It is not really about MPI, it is about how many connections per
second
the applications running on the nodes in a cluster can establish,  
e.g, SDP establishes a connection for every
socket that is opened. So trying to run various applications (like Web
servers and such
that establish lots of connections) over SDP would also run into this
problem
once the cluster size got large enough.  
What we have observed is that today, on a host based SM/SA that is
running 
on the fastest server I have, can handle around 15,000 queries per
second.
That means that the entire cluster no matter what size it is, can never
establish more than that. 

A year ago, we knew that since the SA is a centralized service that it
would be a bottleneck
with scaling the number of connections per second, 
and that is why the local_sa cache was developed. 

What we have today (for SA caching)
may not be the best long term solution, but is something
that we can start with and enhance or develop a better solution later
without
changing any applications. Further, the current local_sa caching can be
enabled or
disabled at module load time, so people that do not want to use it, do
not
have to. So it does not hurt anything, since you can turn it off if you
do 
not want to use it, and it does provide benefits to existing
applications, e.g,
Intel MPI, and probably OpenMPI when the move to use RDMA_CM. 

woody