[openib-general] SA cache design

Fri Jan 6 12:11:34 PST 2006

Hi Todd,

So you agree we will need to design "replica" buildup scalability features into the solution ( to avoid the bring-up load on the SA) ?

Why would a caching system not work here? Instead of replicating the data.

The caching concept allows for the SA to still be in the loop by invalidating the cache or through cache entries lifetime policy.

The reason I think a total replica (distribution of the SA) would eventually be problematic is that as we approach QoS solutions,
some need for path record use and retirement is going to show up. What if the SM decides to change SL2VL maps due to new QoS requirement.
We will need a more complicated "synchronization" or invalidation technique to push that kind of data into the "replica" SAs.

Eitan

Rimmer, Todd wrote:
>>From: Eitan Zahavi [mailto:eitan at mellanox.co.il]
>>Hi Sean, Todd,
>>
>>Although I like the "replica" idea for its "query" 
>>performance boost - I suspect it will actually do not scale 
>>for very large
>>networks: Each node has to query for the entire database 
>>would cause N^2 load on the SA.
>>After any change (which do happen with higher probability on 
>>large networks) the SA will need to send each Report to N targets.
>>
>>We already have some bad experience with large clusters SA 
>>query issues, like the one reported by Roland
>>"searching for SRP targets using PortInfo capability mask".
>>
> 
> Our experience has been the exact opposite.
> While there is an initial load on the SA to populate the replica (which we have used various techniques to reduce such as backing off when the SA reports Busy, having a random time offset of start of query, etc).  The boost occurs when a new application starts, such as an MPI using the SA/CM to establish connections as per the IBTA spec.  A 1000 process MPI job would have each process make 999 queries to the SA at job startup time.  This causes a burst of 999,0000 sets of SA queries (most will involve both Node Record and Path record queries so it will really be 2x this amount), BEFORE the MPI job can actually start.
> 
> As Open IB moves forward to implement QOS and other features, MPI will have to use the SA to get its path records.  If you study MVAPICH at present, it merely exchanges LIDs between nodes and hardcodes (or via enviornment variables uses the same value for all processes) all the other QOS parameters.  In a true QOS and congestion management environment it will instead have to use the CM/SA.
> 
> We have been using this replica technique quite successfully for 2-3 years now.  Our MPI has used the SA/CM for connection establishment for just as long.
> 
> As it was pointed out, most fabrics will be quite stable.  Hence having a replica and paying the cost of the SA queries once will be much more efficient than paying that cost on every application startup.
> 
> Todd Rimmer
>