[openib-general] Re: ib_mad: Scenarios for returning posted send MADs

Mon Oct 4 12:52:08 PDT 2004

On Mon, 04 Oct 2004 15:34:51 -0400
Hal Rosenstock <halr at voltaire.com> wrote:

> I am pretty sure there is a window here as follows:
> First, deregistration cancels the MAD removing it from the agent send
> list.
> ib_mad_complete_send_wr is invoked some time later and never checks for
> the send WR still being on the agent send list. It just assumes it is.
> It potentially makes a send callback.

The deregistration only removes the mad_send_wr from the agent send list if its reference count is zero.  A reference is held on the mad_send_wr from the time that a work request is posted to the port, until a completion is reported.  So, you should never get a callback for a mad_send_wr, unless its reference count is at least one.

> Aren't some errors fine grained and pertain only to the WR supplied
> whereas other errors are coarser (like fatal and general) and might
> apply to something larger (perhaps the port but maybe the QP) ? I wonder
> whether there is any assistance in the Mellanox documentation as to
> which errors should be treated how.

I was referring to errors that applied to a single work request only.  For fatal errors that we cannot recover from, we may need a way to report such errors to the user to indicate that their mad_agent is no longer operational.

> > It would help in this case for the port layer code 
> > just return completions for all queued work requests to the MAD 
> > agents, and let the MAD agent code deal with the issue.
> 
> True for most errors. Not sure about fatal and general errors yet.

I think it would depend on the error code that was reported in the send_mad_wc.  If the return code is flushed, the mad_agent could just repost the send.  If the return code is fatal error, it should complete the MAD to the client.

> > > 3. The final scenario is board (not currently possible) or module
> > > removal. My concern here is about potential send callbacks (indicating
> > > FLUSHED) to a potentially stale MAD agent. When the module is removed
> > > non forceably, the clients (upper layer modules) would need to be
> > > removed first, which should cause the proper deregistration (and these
> > > MADs would be cancelled so there would be none to cleanup). I am not
> > > sure what the rules for proper behavior are on forceable module removal.
> > > Board removal would be similar to this (the forceable module removal
> > > case).
> > 
> > Deregistration is a synchronous process, so will wait until all 
> > send MADs have completed.  If this isn't happening, then the 
> > referencing counting is off somewhere.
> 
> I think deregistration is fine (short of issue 1 which I think is
> readily fixable). I was more asking about the asynchronous scenario here
> (forced module (or board) removal) where that isn't the case.

Unless there's a bug in the code, I don't believe that we can have send callbacks to stale MAD agents.  If you're trying to have the code deregister for a client, this would be impossible.  Clients should receive some sort of removal notification event and would need to deregister in response to that event.