[openib-general] SA cache design
Rimmer, Todd
trimmer at silverstorm.com
Thu Jan 12 10:46:26 PST 2006
> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> I'm not sure what the speed-up of any cache will be. The SA
> maintains a
> database of various related records - node records, path
> records, service
> records, etc. and responds to queries. This need doesn't go
> away. The SA
> itself is perfect candidate to be implemented using a DBMS.
> (And if one had
> been implemented over a DBMS, I'm not even sure that we'd be
> talking about
> scalability issues for only a few thousand nodes. Is the
> perceived lack of
> scalability of the SA a result of the architecture or the
> existing implementations?)
The scalability problem occurs during things like MPI job startup.
At start up, you will have N processes which each need N-1 path
records to establish connections. Those queries require both Node Record
and Path Record queries.
This means at job startup, the SA must process O(N^2) SA queries.
If the lookup algorithm in the SA is O(logM) {M= number of SA records,
which is O(N^2)), then the SA will have
O(N^2 log(N^2)) operations to perform and O(N^2) packets to send and receive.
For a 4000 CPU cluster (1000 nodes with 2 dual core CPUs each),
that is over 16 million SA queries at job startup against a 1 million entry
SA database.
It would take quite a good SA database implementation to handle than
in a timely manner.
In contrast the replica on each node only needs to handle O(N) entries.
And its lookup time could be O(logN).
You'll note I spoke of processes, not nodes. In multi-CPU nodes,
each process will need similar information. This is one area where a
replica can greatly help, why ask the SA the same question multiple times
in a row?
If only a cache is considered, then the startup is still O(N^2) SA queries
its just that we have 1/CPU-per-Node as many queries.
Todd Rimmer
More information about the general
mailing list