[openib-general] SA cache design

Thu Jan 5 04:27:06 PST 2006

Hi Sean,

This is great initiative - tackling an important issue.
I am glad you took this on.

Please see below.

Sean Hefty wrote:
> I've been given the task of trying to come up with an implementation for 
> an SA cache.  The intent is to increase the scalability and performance 
> of the openib stack.  My current thoughts on the implementation are 
> below.  Any feedback is welcome.
> 
> To keep the design as flexible as possible, my plan is to implement the 
> cache in userspace.  The interface to the cache would be via MADs.  
> Clients would send their queries to the sa_cache instead of the SA 
> itself.  The format of the MADs would be essentially identical to those 
> used to query the SA itself.  Response MADs would contain any requested 
> information.  If the cache could not satisfy a request, the sa_cache 
> would query the SA, update its cache, then return a reply.
* I think the idea of using MADs to interface with the cache is very good.
* User space implementation:
   This also might be a good tradeoff between coding and debugging versus the
   the impact on number of connections per second. I hope the impact on performance
   will not be too big. Maybe we can take the path of implementing in user space and
   if the performance penalty will be too high we can port to kernel.
* Regarding the sentence:"Clients would send their queries to the sa_cache instead of the SA"
   I would propose that a "SA MAD send switch" be implemented in the core: Such a switch
   will enable plugging in the SA cache (I would prefer calling it SA local agent due to
   its extended functionality). Once plugged in, this "SA local agent" should be forwarded all
   outgoing SA queries. Once it handles the MAD it should be able to inject the response through
   the core "SA MAD send switch" as if they arrived from the wire.
> 
> The benefits that I see with this approach are:
> 
> + Clients would only need to send requests to the sa_cache.
> + The sa_cache can be implemented in stages.  Requests that it cannot 
> handle would just be forwarded to the SA.
> + The sa_cache could be implemented on each host, or a select number of 
> hosts.
> + The interface to the sa_cache is similar to that used by the SA.
> + The cache would use virtual memory and could be saved to disk.
> 
> Some drawbacks specific to this method are:
> 
> - The MAD interface will result in additional data copies and userspace 
> to kernel transitions for clients residing on the local system.
> - Clients require a mechanism to locate the sa_cache, or need to make 
> assumptions about its location.
The proposal for "SA MAD send switch" in the core will resolve this issue.
No client change will be required as all MADs are sent through the core which will
redirect them to the SA agent ...

Functional requirements:
* It is clear that the first SA query to cache is PathRecord.
   So if a new client wants to connect to another node a new PathRecord
   query will not need to be sent to the SA. However, recent work on QoS has pointed out
   that under some QoS schemes PathRecord should not be shared by different clients
   or even connections. There are several ways to make such QoS scheme scale.
   Since this is a different discussion topic - I only bring this up such that
   we take into account caching might also need to be done by a complex key (not just
   SRC/DST ...)
* Forgive me for bringing the following issue - over and over to the group:
   Multicast Join/Leave should be reference counted. The "SA local agent" could be
   the right place for doing this kind of reference counting (actually if it does that
   it probably needs to be located in the Kernel - to enable cleanup after killed processes).
* Similarly - "Client re-registration" could be made transparent to clients.

Cache Invalidation:
Several discussions about PathRecord invalidation were spawn in the past.
IMO, it is enough to be notified about death of local processes, remote port availability (trap 64/65) and
multicast group availability (trap 66/67) in order to invalidate SA cache information.
So each SA Agent could register to obtain this data. But that solution does not nicely scale,
as the SA needs to send notification to all nodes (but is reliable - could resend until Repressed).
However, current IBTA definition for InformInfo (event forwarding mechanism) does not
allow for multicast of Report(Notice). The reason is that registration for event forwarding
is done with  Set(InformInfo) which uses the requester QP and LID as the address for sending
the matching report. A simple way around that limitation could be to enable the SM to "pre-register"
a well known multicast group target for event forwarding. One issue though, would be that UD multicast
is not reliable and some notifications could get lost. A notification sequence number could be used
to catch these missed notifications eventually.

Eitan