[ofa-general] Recalculated Queue Sizes caused mthca Catastrophic Errors

Steve Wise swise at opengridcomputing.com
Wed Feb 20 15:36:10 PST 2008


Roland Dreier wrote:
>  > My code creates a CQ of size 2072, and a SQ of size 2056, and a RQ of
>  > size 16.  As you can see, CQ = SQ + RQ.  So, I should never overflow my
>  > CQ.
>  > 
>  > The Driver raises each of these to the next power of two.  So, we get a
>  > CQ of size 4096, a SQ of size 4096, and an RQ of size 16.
>  > 
>  > As you can see, CQ < SQ + RQ, so it is possible to overflow the CQ.
>  > 
>  > I don't think that this should cause the Firmware to generate a
>  > Catastrophic error (sounds like a bug in the firmware, if you ask me).
> 
> Yes, as the release notes mention, it appears to be a
> hardware/firmware bug that you get a catastrophic error.  However,
> overflowing a CQ will generate a CQ overrun asynchronous error in the
> best case.
> 
>  > As I said, doubling the queue size solves the problem.  However, it
>  > would be better if the mthca driver did not create the problem in the
>  > first place.  If a QP is being created such that CQ >= SQ + RQ, then
>  > that relationship should be maintained.  Do others agree with me?
> 
> I don't see any problem in rounding up the queue sizes.  Just because
> you got bigger SQ and RQ sizes than you asked for doesn't mean you
> have to use them -- it is the applications responsibility to avoid
> overrunning a CQ.  For the HCA in question, all the queues must be a
> power of 2 in size; the driver can't give you a size smaller than you
> asked for, so there's not really anything better we could do.
> 

The driver could enforce the app's requested sizes even though the 
queues are bigger.

But I think the correct answer is the app should just use the sizes it 
requested and flow control on that -or- resize the cq after creating the 
qp and use "posting until it fails" to flow control.  However, for 
send/recvs, I'm sure the app has to do its own flow control anyway.

Steve.




More information about the general mailing list