[openib-general] [PATCH 0/4] SA path record caching
Or Gerlitz
ogerlitz at voltaire.com
Wed Feb 1 07:29:04 PST 2006
Sean Hefty wrote:
> The cache
> is updated using an SA GET_TABLE request, which is more efficient than
> sending separate SA GET requests for each path record.
> Your assumption is correct. The implementation will contain copies of
> all path records whose SGID is a local node GID. (Currently it contains
> only a single path record per SGID/DGID, but that will be expanded.)
Taking into account an invalidation window of 15 minutes which you have
mentioned in one of the emails and doing some math i have come into the
following:
For 1k node/port fabric the SM/SA need to xmit a table of 1k paths to
each local SA, where you can embed 3 paths in a MAD, which would take at
least 350 MADs (330 RMPP segments + 20 ACKS). Since we have 1k nodes
there are 350K MADs to xmit, and if we assume xmit is uniform over the
1k seconds (1000 second = 16 minutes & 40 seconds invalidation window)
we require the -----SM to xmit in constant rate of 350k/1k = 350
MADs/sec forever-----. And this is RMPP, so depending on the RMPP impl
it would run into re-transmission of segments or the whole payload. And
each such table takes 90K (350*256) RAM so the SM needs to allow for up
to 90MB of RAM to hold all those tables.
Aren't we creating a monster here??? if this is SA replica which should
work for scale from day one, lets call it this way and see how to reach
there.
> I view MPI as one of the primary reasons for having a cache. Waiting
> for a
> failed lookup to create the initial cache would delay the startup time
> for apps wanting all-to-all connection establishment. In this case, we
> also get the side effect that the SA receives GET_TABLE requests from
> every node at roughly the same time.
Talking MPI, here are few points that seems to me somehow un addressed
in the all-to-all cache design:
+ neither MVAPICH nor OpenMPI are using path query
+ OpenMPI is opening its connections "per demand" that is only if rank I
attempts to send a message to rank J then I connects to J
+ even MPIs that connect all-to-all in an N ranks JOB would do only
n(n-1)/2 path queries, so the load aggregated load on the SA is half
what the all-to-all caching scheme is generating
Or.
More information about the general
mailing list