[openib-general] [PATCH 0/4] SA path record caching

Wed Feb 1 07:29:04 PST 2006

Sean Hefty wrote:
> The cache 
> is updated using an SA GET_TABLE request, which is more efficient than 
> sending separate SA GET requests for each path record.  
> Your assumption is correct.  The implementation will contain copies of 
> all path records whose SGID is a local node GID.  (Currently it contains 
> only a single path record per SGID/DGID, but that will be expanded.)

Taking into account an invalidation window of 15 minutes which you have 
mentioned in one of the emails and doing some math i have come into the 
following:

For 1k node/port fabric the SM/SA need to xmit a table of 1k paths to 
each local SA, where you can embed 3 paths in a MAD, which would take at 
least 350 MADs (330 RMPP segments + 20 ACKS). Since we have 1k nodes 
there are 350K MADs to xmit, and if we assume xmit is uniform over the 
1k seconds (1000 second = 16 minutes & 40 seconds invalidation window) 
we require the -----SM to xmit in constant rate of 350k/1k = 350 
MADs/sec forever-----. And this is RMPP, so depending on the RMPP impl 
it would run into re-transmission of segments or the whole payload. And 
each such table takes 90K (350*256) RAM so the SM needs to allow for up 
to 90MB of RAM to hold all those tables.

Aren't we creating a monster here??? if this is SA replica which should 
work for scale from day one, lets call it this way and see how to reach 
there.

 > I view MPI as one of the primary reasons for having a cache. Waiting 
  > for a
 > failed lookup to create the initial cache would delay the startup time
 > for apps wanting all-to-all connection establishment. In this case, we
 > also get the side effect that the SA receives GET_TABLE requests from
 > every node at roughly the same time.

Talking MPI, here are few points that seems to me somehow un addressed 
in the all-to-all cache design:

+ neither MVAPICH nor OpenMPI are using path query

+ OpenMPI is opening its connections "per demand" that is only if rank I 
attempts to send a message to rank J then I connects to J

+ even MPIs that connect all-to-all in an N ranks JOB would do only 
n(n-1)/2 path queries, so the load aggregated load on the SA is half 
what the all-to-all caching scheme is generating

Or.