[openib-general] SA cache design

Rimmer, Todd trimmer at silverstorm.com
Fri Jan 6 05:50:33 PST 2006


> From: Eitan Zahavi [mailto:eitan at mellanox.co.il]
> Hi Sean, Todd,
> 
> Although I like the "replica" idea for its "query" 
> performance boost - I suspect it will actually do not scale 
> for very large
> networks: Each node has to query for the entire database 
> would cause N^2 load on the SA.
> After any change (which do happen with higher probability on 
> large networks) the SA will need to send each Report to N targets.
> 
> We already have some bad experience with large clusters SA 
> query issues, like the one reported by Roland
> "searching for SRP targets using PortInfo capability mask".
> 
Our experience has been the exact opposite.
While there is an initial load on the SA to populate the replica (which we have used various techniques to reduce such as backing off when the SA reports Busy, having a random time offset of start of query, etc).  The boost occurs when a new application starts, such as an MPI using the SA/CM to establish connections as per the IBTA spec.  A 1000 process MPI job would have each process make 999 queries to the SA at job startup time.  This causes a burst of 999,0000 sets of SA queries (most will involve both Node Record and Path record queries so it will really be 2x this amount), BEFORE the MPI job can actually start.

As Open IB moves forward to implement QOS and other features, MPI will have to use the SA to get its path records.  If you study MVAPICH at present, it merely exchanges LIDs between nodes and hardcodes (or via enviornment variables uses the same value for all processes) all the other QOS parameters.  In a true QOS and congestion management environment it will instead have to use the CM/SA.

We have been using this replica technique quite successfully for 2-3 years now.  Our MPI has used the SA/CM for connection establishment for just as long.

As it was pointed out, most fabrics will be quite stable.  Hence having a replica and paying the cost of the SA queries once will be much more efficient than paying that cost on every application startup.

Todd Rimmer



More information about the general mailing list