[ofa-general] Recalculated Queue Sizes caused mthca Catastrophic Errors

Roland Dreier rdreier at cisco.com
Wed Feb 20 15:32:59 PST 2008


 > My code creates a CQ of size 2072, and a SQ of size 2056, and a RQ of
 > size 16.  As you can see, CQ = SQ + RQ.  So, I should never overflow my
 > CQ.
 > 
 > The Driver raises each of these to the next power of two.  So, we get a
 > CQ of size 4096, a SQ of size 4096, and an RQ of size 16.
 > 
 > As you can see, CQ < SQ + RQ, so it is possible to overflow the CQ.
 > 
 > I don't think that this should cause the Firmware to generate a
 > Catastrophic error (sounds like a bug in the firmware, if you ask me).

Yes, as the release notes mention, it appears to be a
hardware/firmware bug that you get a catastrophic error.  However,
overflowing a CQ will generate a CQ overrun asynchronous error in the
best case.

 > As I said, doubling the queue size solves the problem.  However, it
 > would be better if the mthca driver did not create the problem in the
 > first place.  If a QP is being created such that CQ >= SQ + RQ, then
 > that relationship should be maintained.  Do others agree with me?

I don't see any problem in rounding up the queue sizes.  Just because
you got bigger SQ and RQ sizes than you asked for doesn't mean you
have to use them -- it is the applications responsibility to avoid
overrunning a CQ.  For the HCA in question, all the queues must be a
power of 2 in size; the driver can't give you a size smaller than you
asked for, so there's not really anything better we could do.

 - R.




More information about the general mailing list