[ofa-general] Recalculated Queue Sizes caused mthca Catastrophic Errors
Roland Dreier
rdreier at cisco.com
Wed Feb 20 15:32:59 PST 2008
> My code creates a CQ of size 2072, and a SQ of size 2056, and a RQ of
> size 16. As you can see, CQ = SQ + RQ. So, I should never overflow my
> CQ.
>
> The Driver raises each of these to the next power of two. So, we get a
> CQ of size 4096, a SQ of size 4096, and an RQ of size 16.
>
> As you can see, CQ < SQ + RQ, so it is possible to overflow the CQ.
>
> I don't think that this should cause the Firmware to generate a
> Catastrophic error (sounds like a bug in the firmware, if you ask me).
Yes, as the release notes mention, it appears to be a
hardware/firmware bug that you get a catastrophic error. However,
overflowing a CQ will generate a CQ overrun asynchronous error in the
best case.
> As I said, doubling the queue size solves the problem. However, it
> would be better if the mthca driver did not create the problem in the
> first place. If a QP is being created such that CQ >= SQ + RQ, then
> that relationship should be maintained. Do others agree with me?
I don't see any problem in rounding up the queue sizes. Just because
you got bigger SQ and RQ sizes than you asked for doesn't mean you
have to use them -- it is the applications responsibility to avoid
overrunning a CQ. For the HCA in question, all the queues must be a
power of 2 in size; the driver can't give you a size smaller than you
asked for, so there's not really anything better we could do.
- R.
More information about the general
mailing list