[openib-general] Re: ib_mad: Scenarios for returning posted send MADs
Sean Hefty
mshefty at ichips.intel.com
Mon Oct 4 12:52:08 PDT 2004
On Mon, 04 Oct 2004 15:34:51 -0400
Hal Rosenstock <halr at voltaire.com> wrote:
> I am pretty sure there is a window here as follows:
> First, deregistration cancels the MAD removing it from the agent send
> list.
> ib_mad_complete_send_wr is invoked some time later and never checks for
> the send WR still being on the agent send list. It just assumes it is.
> It potentially makes a send callback.
The deregistration only removes the mad_send_wr from the agent send list if its reference count is zero. A reference is held on the mad_send_wr from the time that a work request is posted to the port, until a completion is reported. So, you should never get a callback for a mad_send_wr, unless its reference count is at least one.
> Aren't some errors fine grained and pertain only to the WR supplied
> whereas other errors are coarser (like fatal and general) and might
> apply to something larger (perhaps the port but maybe the QP) ? I wonder
> whether there is any assistance in the Mellanox documentation as to
> which errors should be treated how.
I was referring to errors that applied to a single work request only. For fatal errors that we cannot recover from, we may need a way to report such errors to the user to indicate that their mad_agent is no longer operational.
> > It would help in this case for the port layer code
> > just return completions for all queued work requests to the MAD
> > agents, and let the MAD agent code deal with the issue.
>
> True for most errors. Not sure about fatal and general errors yet.
I think it would depend on the error code that was reported in the send_mad_wc. If the return code is flushed, the mad_agent could just repost the send. If the return code is fatal error, it should complete the MAD to the client.
> > > 3. The final scenario is board (not currently possible) or module
> > > removal. My concern here is about potential send callbacks (indicating
> > > FLUSHED) to a potentially stale MAD agent. When the module is removed
> > > non forceably, the clients (upper layer modules) would need to be
> > > removed first, which should cause the proper deregistration (and these
> > > MADs would be cancelled so there would be none to cleanup). I am not
> > > sure what the rules for proper behavior are on forceable module removal.
> > > Board removal would be similar to this (the forceable module removal
> > > case).
> >
> > Deregistration is a synchronous process, so will wait until all
> > send MADs have completed. If this isn't happening, then the
> > referencing counting is off somewhere.
>
> I think deregistration is fine (short of issue 1 which I think is
> readily fixable). I was more asking about the asynchronous scenario here
> (forced module (or board) removal) where that isn't the case.
Unless there's a bug in the code, I don't believe that we can have send callbacks to stale MAD agents. If you're trying to have the code deregister for a client, this would be impossible. Clients should receive some sort of removal notification event and would need to deregister in response to that event.
More information about the general
mailing list