[openib-general] Local SA caching - why we need it

Todd Rimmer todd.rimmer at qlogic.com
Tue Nov 28 10:13:18 PST 2006


> From: openib-general-bounces at openib.org [mailto:openib-general-
> bounces at openib.org] On Behalf Of Woodruff, Robert J
> Sent: Tuesday, November 28, 2006 10:52 AM
> To: Michael S. Tsirkin; Davis, Arlin R; Hefty, Sean
> Cc: openib
> Subject: [openib-general] Local SA caching - why we need it
> 
> I suppose we should start a new thread for this discussion.
> I have added Arlin and Sean, who have more details about the problems
> that we have seen on a 256 node cluster with connection scaling
> with OFED 1.1 and how the local sa cache helps solve the problem.
> There was already one thread on this issue on the list, but
> suppose we should have the discussion again.
> 
> I will let Sean and Arlin provide the details of why
> an SA cache is needed to allow connection establishment scaling to
very
> large clusters
> since the SA can only handle a limited number of queries per second
> and quickly becomes the bottleneck in trying to establish all-to-all
> communications for MPI or other applications need all-to-all
> communications. Intel MPI already sees this problem on a 256 node
> cluster.
> Other MPIs would see the problem, but are using a bad technique of
using
> sockets to exchange QP information and hard code connections,
> which has serious problems of it's own.
> 
> Arlin and Sean can provide the details.
> 
> woody
> 

I agree with Woody.  Sa Caching (actually SA replication) is a very
important feature for scalability.  SilverStorm (now QLogic) has been
shipping a stack with Sa Replication for over 4 years now and has found
it to be a key feature for scalability and rapid job startup.

Many of the MPIs presently available for the SilverStorm stack use the
CM for connection establishment (directly or indirectly via uDAPL),
including Intel MPI and an others.  As Woody mentions proper use of the
CM and SA are critical for MPI (and other applications) to properly use
advanced features such as QOS, Partitioning, multi-LIDs per port, proper
timeouts for multi-tiered CLOS fabrics, etc.

If you look at MPI job startup for a 1000 node cluster with 4 processes
per node and a fully connected MPI, you have 4000 processes each asking
for 3999 path records, That's 16,000,000 SA queries to be processed (or
4000 queries each with >= 4000 path records per response).  If the SA
can do 10,000 queries per second, its still 1600 seconds for the job to
start.  Clearly way too slow.

In comparison, if each node is maintaining its own SA cache/replica of
the information relevant to it, the SA is not involved in the job
startup at all and the job startup time can be a few seconds rather than
minutes or hours.

A well designed SA cache/replica can use the assorted InformInfo notices
from the SM to detect when GIDs come and go and hence properly update
the relevant subset of its replica.

Todd Rimmer




More information about the general mailing list