[openib-general] SA cache design
Hal Rosenstock
halr at voltaire.com
Thu Jan 5 07:20:45 PST 2006
Hi Eitan,
On Thu, 2006-01-05 at 07:27, Eitan Zahavi wrote:
> Hi Sean,
>
> This is great initiative - tackling an important issue.
> I am glad you took this on.
>
> Please see below.
>
> Sean Hefty wrote:
> > I've been given the task of trying to come up with an implementation for
> > an SA cache. The intent is to increase the scalability and performance
> > of the openib stack. My current thoughts on the implementation are
> > below. Any feedback is welcome.
> >
> > To keep the design as flexible as possible, my plan is to implement the
> > cache in userspace. The interface to the cache would be via MADs.
> > Clients would send their queries to the sa_cache instead of the SA
> > itself. The format of the MADs would be essentially identical to those
> > used to query the SA itself. Response MADs would contain any requested
> > information. If the cache could not satisfy a request, the sa_cache
> > would query the SA, update its cache, then return a reply.
> * I think the idea of using MADs to interface with the cache is very good.
> * User space implementation:
> This also might be a good tradeoff between coding and debugging versus the
> the impact on number of connections per second. I hope the impact on performance
> will not be too big. Maybe we can take the path of implementing in user space and
> if the performance penalty will be too high we can port to kernel.
> * Regarding the sentence:"Clients would send their queries to the sa_cache instead of the SA"
> I would propose that a "SA MAD send switch" be implemented in the core: Such a switch
> will enable plugging in the SA cache (I would prefer calling it SA local agent due to
> its extended functionality). Once plugged in, this "SA local agent" should be forwarded all
> outgoing SA queries. Once it handles the MAD it should be able to inject the response through
> the core "SA MAD send switch" as if they arrived from the wire.
> >
> > The benefits that I see with this approach are:
> >
> > + Clients would only need to send requests to the sa_cache.
> > + The sa_cache can be implemented in stages. Requests that it cannot
> > handle would just be forwarded to the SA.
> > + The sa_cache could be implemented on each host, or a select number of
> > hosts.
> > + The interface to the sa_cache is similar to that used by the SA.
> > + The cache would use virtual memory and could be saved to disk.
> >
> > Some drawbacks specific to this method are:
> >
> > - The MAD interface will result in additional data copies and userspace
> > to kernel transitions for clients residing on the local system.
> > - Clients require a mechanism to locate the sa_cache, or need to make
> > assumptions about its location.
> The proposal for "SA MAD send switch" in the core will resolve this issue.
> No client change will be required as all MADs are sent through the core which will
> redirect them to the SA agent ...
I see this as more granular than a complete switch for the entire class.
More like on a per attribute basis.
> Functional requirements:
> * It is clear that the first SA query to cache is PathRecord.
> So if a new client wants to connect to another node a new PathRecord
> query will not need to be sent to the SA. However, recent work on QoS has pointed out
> that under some QoS schemes PathRecord should not be shared by different clients
> or even connections. There are several ways to make such QoS scheme scale.
> Since this is a different discussion topic - I only bring this up such that
> we take into account caching might also need to be done by a complex key (not just
> SRC/DST ...)
Per the QoS direction, this complex key is indeed part of the enhanced
PathRecord, right ?
> * Forgive me for bringing the following issue - over and over to the group:
> Multicast Join/Leave should be reference counted. The "SA local agent" could be
> the right place for doing this kind of reference counting (actually if it does that
> it probably needs to be located in the Kernel - to enable cleanup after killed processes).
The cache itself may need another level of reference counting (even if
invalidation is broadcast).
> * Similarly - "Client re-registration" could be made transparent to clients.
>
> Cache Invalidation:
> Several discussions about PathRecord invalidation were spawn in the past.
> IMO, it is enough to be notified about death of local processes, remote port availability (trap 64/65) and
> multicast group availability (trap 66/67) in order to invalidate SA cache information.
I think that it's more complicated than this. As an example, how does
the SA cache know whether a cached path record needs to be changed based
on traps 64/65 ? It seems to me to need to be tightly tied to the SM/SA
for this.
> So each SA Agent could register to obtain this data. But that solution does not nicely scale,
> as the SA needs to send notification to all nodes (but is reliable - could resend until Repressed).
> However, current IBTA definition for InformInfo (event forwarding mechanism) does not
> allow for multicast of Report(Notice). The reason is that registration for event forwarding
> is done with Set(InformInfo) which uses the requester QP and LID as the address for sending
> the matching report. A simple way around that limitation could be to enable the SM to "pre-register"
> a well known multicast group target for event forwarding. One issue though, would be that UD multicast
> is not reliable and some notifications could get lost. A notification sequence number could be used
> to catch these missed notifications eventually.
A multicast group could be defined for SA caching. The reliable aspects
are another matter although the represses could be unicast back to the
cache.
-- Hal
> Eitan
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list