[openib-general] SA cache design

Thu Jan 5 07:20:45 PST 2006

Hi Eitan,

On Thu, 2006-01-05 at 07:27, Eitan Zahavi wrote:
> Hi Sean,
> 
> This is great initiative - tackling an important issue.
> I am glad you took this on.
> 
> Please see below.
> 
> Sean Hefty wrote:
> > I've been given the task of trying to come up with an implementation for 
> > an SA cache.  The intent is to increase the scalability and performance 
> > of the openib stack.  My current thoughts on the implementation are 
> > below.  Any feedback is welcome.
> > 
> > To keep the design as flexible as possible, my plan is to implement the 
> > cache in userspace.  The interface to the cache would be via MADs.  
> > Clients would send their queries to the sa_cache instead of the SA 
> > itself.  The format of the MADs would be essentially identical to those 
> > used to query the SA itself.  Response MADs would contain any requested 
> > information.  If the cache could not satisfy a request, the sa_cache 
> > would query the SA, update its cache, then return a reply.
> * I think the idea of using MADs to interface with the cache is very good.
> * User space implementation:
>    This also might be a good tradeoff between coding and debugging versus the
>    the impact on number of connections per second. I hope the impact on performance
>    will not be too big. Maybe we can take the path of implementing in user space and
>    if the performance penalty will be too high we can port to kernel.
> * Regarding the sentence:"Clients would send their queries to the sa_cache instead of the SA"
>    I would propose that a "SA MAD send switch" be implemented in the core: Such a switch
>    will enable plugging in the SA cache (I would prefer calling it SA local agent due to
>    its extended functionality). Once plugged in, this "SA local agent" should be forwarded all
>    outgoing SA queries. Once it handles the MAD it should be able to inject the response through
>    the core "SA MAD send switch" as if they arrived from the wire.
> > 
> > The benefits that I see with this approach are:
> > 
> > + Clients would only need to send requests to the sa_cache.
> > + The sa_cache can be implemented in stages.  Requests that it cannot 
> > handle would just be forwarded to the SA.
> > + The sa_cache could be implemented on each host, or a select number of 
> > hosts.
> > + The interface to the sa_cache is similar to that used by the SA.
> > + The cache would use virtual memory and could be saved to disk.
> > 
> > Some drawbacks specific to this method are:
> > 
> > - The MAD interface will result in additional data copies and userspace 
> > to kernel transitions for clients residing on the local system.
> > - Clients require a mechanism to locate the sa_cache, or need to make 
> > assumptions about its location.
> The proposal for "SA MAD send switch" in the core will resolve this issue.
> No client change will be required as all MADs are sent through the core which will
> redirect them to the SA agent ...

I see this as more granular than a complete switch for the entire class.
More like on a per attribute basis.

> Functional requirements:
> * It is clear that the first SA query to cache is PathRecord.
>    So if a new client wants to connect to another node a new PathRecord
>    query will not need to be sent to the SA. However, recent work on QoS has pointed out
>    that under some QoS schemes PathRecord should not be shared by different clients
>    or even connections. There are several ways to make such QoS scheme scale.
>    Since this is a different discussion topic - I only bring this up such that
>    we take into account caching might also need to be done by a complex key (not just
>    SRC/DST ...)

Per the QoS direction, this complex key is indeed part of the enhanced
PathRecord, right ?

> * Forgive me for bringing the following issue - over and over to the group:
>    Multicast Join/Leave should be reference counted. The "SA local agent" could be
>    the right place for doing this kind of reference counting (actually if it does that
>    it probably needs to be located in the Kernel - to enable cleanup after killed processes).

The cache itself may need another level of reference counting (even if
invalidation is broadcast).

> * Similarly - "Client re-registration" could be made transparent to clients.
> 
> Cache Invalidation:
> Several discussions about PathRecord invalidation were spawn in the past.
> IMO, it is enough to be notified about death of local processes, remote port availability (trap 64/65) and
> multicast group availability (trap 66/67) in order to invalidate SA cache information.

I think that it's more complicated than this. As an example, how does
the SA cache know whether a cached path record needs to be changed based
on traps 64/65 ? It seems to me to need to be tightly tied to the SM/SA
for this.

> So each SA Agent could register to obtain this data. But that solution does not nicely scale,
> as the SA needs to send notification to all nodes (but is reliable - could resend until Repressed).
> However, current IBTA definition for InformInfo (event forwarding mechanism) does not
> allow for multicast of Report(Notice). The reason is that registration for event forwarding
> is done with  Set(InformInfo) which uses the requester QP and LID as the address for sending
> the matching report. A simple way around that limitation could be to enable the SM to "pre-register"
> a well known multicast group target for event forwarding. One issue though, would be that UD multicast
> is not reliable and some notifications could get lost. A notification sequence number could be used
> to catch these missed notifications eventually.

A multicast group could be defined for SA caching. The reliable aspects
are another matter although the represses could be unicast back to the
cache.

-- Hal

> Eitan
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general