[openib-general] [PATCH] add cq error events

Thu Sep 22 15:32:44 PDT 2005

> -----Original Message-----
> From: Sean Hefty [mailto:mshefty at ichips.intel.com] 
> Sent: Thursday, September 22, 2005 3:12 PM
> To: Caitlin Bestler
> Cc: Michael S. Tsirkin; Roland Dreier; openib-general at openib.org
> Subject: Re: [openib-general] [PATCH] add cq error events
> 
> Caitlin Bestler wrote:
> > If the semantics were defined such that an overrun meant 
> that an event 
> > had been lost, but that the CQ was still intact, then the user can 
> > definitely adjust and continue.
> 
> My understanding is that a CQ overrun error is fatal.  No 
> additional entries may be added to that CQ.  All QPs 
> associated with the CQ will generate an error the next time 
> that they try to access it.  And outstanding completions on 
> the CQ may not be retrievable.  See IB spec 11.6.3.2, C11-38.
> 

Admittedly I was paying more attention to the iWARP specs
on this, but my reading of that section in the IB verbs 
was as follows:

> C11-38: The CI shall generate a CQ Error when a CQ overrun is detected.

The CI shall generate a CQ Error when it detects that it cannot
place a work completion into a CQ.

> This condition will result in an Affiliated Asynchronous Error for any
associated
> Work Queues when they attempt to use that CQ.

While this condition (there not being room in the CQ) persists
an Affilliated Asynch Error must be generated for any QP that
is prevented from placing a work completion in the CQ. (Failure
to place the completion inherently means that the ordering
guarantees for the connection cannot be complied with. So
the connect cannot recover).

>Completions can no longer be added to the CQ.

You cannot recover, so the connection is broken. Since the CQ
was already full don't waste your time trying to flush the 
work requests that would have been flushed.

> It is not guaranteed that completions present in the CQ at
> the time the error occurred can be retrieved. Possible causes
> include a CQ overrun or a CQ protection error.

The implementation is free to detect overflow *after* it has
overwritten an older work completion. It is not constrained
to guarantee that the CQ is intact other than for the lost
work completion.

But it is not required to *prevent* those other completions
from being retrieved, so a more robust CQ is certainly legal.

The RDMAC verbs are not much help here:

> The RI is NOT REQUIRED to perform CQ overflow detection or
> protection. Therefore, the CQ overflow error codes in this
> document are OPTIONAL. When an overflow occurs, the results
> are indeterminate. Overflow of a CQ MUST NOT affect QPs which
> do not report Work Completions to that CQ and MUST NOT affect
> other CQs. Consequently, when creating the CQ, the Consumer
> should request enough outstanding Work Requests so that if
> every possible outstanding WR were to complete (such as may
> happen in an error case), there would be room for the CQE on
> the CQ. The RI MUST NOT enforce that every WQE on every Work
> Queue associated with the CQ must have a CQE available for the
> WQE's Work Completion information.

Translation: Only you can prevent CQ overflows.

The implementation must guarantee that a CQ overflow does not
trash another CQ. Otherwise the Consumer is on their own.
If the CQ is being polite it might tell you that there was an
overflow. If it does, there is no guarantee that you can do anything
with that CQ or any QP that fed it, nor is it guaranteed that there
was any damage.