[openib-general] SA cache design

Thu Jan 5 13:41:02 PST 2006

> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> 
> I've been given the task of trying to come up with an 
> implementation for an SA 
> cache.  The intent is to increase the scalability and 
> performance of the openib 
> stack.  My current thoughts on the implementation are below.  
> Any feedback is 
> welcome.

Sean, This is great.  This is a feature which I find near and dear and is very important to large fabric scalability.  If you look in contrib in the infinicon area, you will see a version of a SA replica which we implemented in the linux_discovery tree.  The version in SVN is a little dated, but has the major features and capabilities.  If you find it useful I could provide a more updated version of that component for your reference.

Some features of it (which you should consider or possibly use as reference code):
- It maintains a full replica of:
	- All Node Records
	- Path Records relevant to this Node (where this node is Source)
	- Device Management Agent records for IOUs, IOCs and Service Records
	- even for a large cluster, the footprint of the above will be < 1MB

- It is implemented in kernel mode
	- while user mode may help during initial debug, it will be important for
		kernel mode ULPs such as SRP, IPoIB and SDP to also make use of these records

- It is infact a replica, not a cache.  It maintains an up to date replica using
	the following techniques
	- registers for SA GID in/out of service notices
		- such notices when received trigger a query of information about that node only
	- schedules a periodic full SA query
		- if notices are successfully registered for, the query is at a slow pace (once every 10 minutes is default, but its configureable)
		- if notices are not successfully registered for, the query is at a faster pace (once a minute, but its configurable)
		- since notices are unreliable, the periodic sweep is needed to cover for lost notices, however the SA should resend notices which are not responded to

- In addition for CAs it performs IOU, IOC and Service record queries and replicates them
	- this allows for very fast access to IOU/IOC/Service record info by drivers like SRP
	- hence allowing for faster reconnection and failure recovery handling

- It can handle SA outages and still respond to queries while the SA is down, the SA is slow, or while the synchronization process is being performed (eg. it does all its queries to a temporary replica then updates the main replica, hence if the queries fail or take a long time, the main replica is still available and reasonably accurate).

- I like the idea of using the same API for SA queries and allowing an SA mux to choose to query the replica or the actual SA.  Hence if later versions choose to extend what is maintained in the replica, it would be transparent to applications
	- The API could allow for a flag to force a query against the replica or against the actual SA, with the default being to allow the "SA mux" to select which to use

> 
> To keep the design as flexible as possible, my plan is to 
> implement the cache in 
> userspace.  The interface to the cache would be via MADs.  
> Clients would send 
> their queries to the sa_cache instead of the SA itself.  The 
> format of the MADs 
> would be essentially identical to those used to query the SA 
> itself.  Response 
> MADs would contain any requested information.  If the cache 
> could not satisfy a 
> request, the sa_cache would query the SA, update its cache, 
> then return a reply.

- in our stack we had a separate more advanced SA query API (refered to the Subnet Driver API).  This has evolved significantly since the old Intel IbAccess days, but still has similarities.  It handled all the details of the query including retries (as specified by the caller), timeouts and even multi-level queries (get path records based on Node Guids, etc).  It also handled the RMPP aspects and hid the intermediate RMPP headers and control protocol.  You may want to consider defining and using such an API instead of MADs, least the user of the SA replica need to also implement RMPP itself.  Given such an API the implementation could choose to query the actual SA or the replica and hide the RMPP details in the SA query case.

Todd Rimmer