[openib-general] [RFC] Notice/InformInfo event reporting

Mon Oct 16 15:02:34 PDT 2006

> From: Sean Hefty [mailto:mshefty at ichips.intel.com]
> Sent: Monday, October 16, 2006 5:33 PM
> To: Rimmer, Todd; Matt Leininger
> Cc: openib
> Subject: Re: [openib-general] [RFC] Notice/InformInfo event reporting
> 
> Rimmer, Todd wrote:
> > My recommendation is option 2.
> 
> Thanks for the response.
> 
> > In large fabrics the SA can be a bottleneck.  It is best for an end
node
> > to register with the SA only for the events which are of actual
interest
> > to the end node.
> 
> Which part of the SA is the bottleneck?  Is it the sending of MADs, or
the
> processing of events to determine which end nodes are interested in
the
> event?
Both can be a bottleneck in a big fabric.  Since the SA needs to always
determine which end nodes are registered for a given event, the fewer
are registered the better.  Even if all the hosts are registered, there
will be other nodes (switches, TCAs, etc) which are not registered, so
the SA will need to always check its list of who to send to.  Since the
notice is not a broadcast, it will need to send a separate packet to
each end node.

Each notice will then get a response from each end node which will need
to be correlated to the outstanding notices so the SA can determine
which notices need to be resent vs those which where acknowledged.

If you consider a large fabric (say 2000+ nodes) and all the events
which the SA can generate (at least 4: Gid in/out multicast in/out of
service) that can be a big bursty load on the SA.  Factor in the nodes
responding to those requests (for example GID in service may trigger
path record queries), and even more work occurs on the SA.

Most HCAs don't optimize the GSI datapath, so data packet rates for SA
packets is less than might be observed on UD or RC QPs.

> 
> My thinking was that if events are rare, then having the SA simply
forward
> the
> events to the end nodes saves processing time on the SA.  So, we can
trade
> off
> SA processing by sending more MADs.  I'm not sure which is worse.
In a functioning fabric, events will be rare.  However its when you
first boot the fabric, reboot the SM or other similar "start up" actions
that things get real busy.

> 
> > With regards to "duplicating dispatching code on every node", rather
> > than duplication, think of this as "distributing event dispatching
code
> > among the interested nodes".  Thinking of it in these terms makes
option
> > 2 stand out as more scalable.
> 
> To provide the highest level of filtering at the SA, we need an
interface
> based
> on Informinfo.  Trying to reference count at that level would be
> difficult.
> (E.g. client 1 wants events for LIDs 2-25, client 2 LIDs 3-4, client 3
> LIDs
> 2-25, client 4 LIDS 15-30, etc.)  I'm not sure we need an interface
this
> complex.  It increases the processing requirements needed of the SA,
and
> may
> increase the number of MADs that it needs to send to a given node.
> (Unless we
> start trying to be really clever with the registration.)
> 
> I was thinking of letting clients register for a particular "class" of
> event,
> then dispatching the events among the registered clients.  But I'm
still
> uncertain about how to define event classes.
> 
> Some expected usage models would be helpful.

In my experience, few clients will filter by LID.  For example a client
interested in GID in service, would want to know about all LIDs.  A
client such as IPoIB would be interested in all multicast groups.  So
perhaps the registration with the SA should be for "all lids" and let
the client filter by LID as needed.

So my interpretation of option 2 is the end node registers once with the
SA for "all lids" for the events which clients are interested in.  Then
the end node can filter appropriately (filtering at the client may be
best).

In general I have found that only a few clients will use events such as:
IPoIb to manage multicast subscriptions (join as send only for new
groups) and SA caches/replicas to keep their cache/replica synchronized.

In the silverstorm stack we created an API for a client to subscribe to
a notice.  It allowed the client to specify: trap number, local HCA port
subscription was applicable to (in case multi-port HCAs on different
fabrics) and information for a callback to the client (client context
void*, function).  The callback provided the client context void*, the
actual NOTICE from the SA and which HCA port it arrived on.

The API in the stack dealt with all the issues of remaining subscribed
(SA reregistraton, port disconnected/reconnected, etc) so the client
merely subscribed, got notice callbacks and later unsubscribed.  In this
style API any LID based filtering would be done in the client itself.

Todd Rimmer