[openib-general] SA cache design

Thu Jan 12 10:46:26 PST 2006

> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> I'm not sure what the speed-up of any cache will be.  The SA 
> maintains a 
> database of various related records - node records, path 
> records, service 
> records, etc. and responds to queries.  This need doesn't go 
> away.  The SA 
> itself is perfect candidate to be implemented using a DBMS.  
> (And if one had 
> been implemented over a DBMS, I'm not even sure that we'd be 
> talking about 
> scalability issues for only a few thousand nodes.  Is the 
> perceived lack of 
> scalability of the SA a result of the architecture or the 
> existing implementations?)

The scalability problem occurs during things like MPI job startup.
At start up, you will have N processes which each need N-1 path
records to establish connections.  Those queries require both Node Record
and Path Record queries.

This means at job startup, the SA must process O(N^2) SA queries.
If the lookup algorithm in the SA is O(logM) {M= number of SA records,
which is O(N^2)), then the SA will have
O(N^2 log(N^2)) operations to perform and O(N^2) packets to send and receive.

For a 4000 CPU cluster (1000 nodes with 2 dual core CPUs each),
that is over 16 million SA queries at job startup against a 1 million entry
SA database.
It would take quite a good SA database implementation to handle than
in a timely manner.

In contrast the replica on each node only needs to handle O(N) entries.
And its lookup time could be O(logN).

You'll note I spoke of processes, not nodes.  In multi-CPU nodes,
each process will need similar information.  This is one area where a
replica can greatly help, why ask the SA the same question multiple times
in a row?

If only a cache is considered, then the startup is still O(N^2) SA queries
its just that we have 1/CPU-per-Node as many queries.

Todd Rimmer