[ofa-general] Recalculated Queue Sizes caused mthca Catastrophic Errors
Roger Spellman
roger at terascala.com
Wed Feb 20 11:50:26 PST 2008
Hello,
I have a Mellanox MT25204, running the latest firmware. A few days ago,
I was getting Catastrophic errors from the firmware. I found the
following in the Release Notes for RHEL-5:
Hardware testing for the Mellanox MT25204 has revealed
that an internal error occurs under certain high-load
conditions. When the ib_mthca driver reports a catastrophic
error on this hardware, it is usually related to an insufficient
completion queue depth relative to the number of outstanding work
requests generated by the user application.
Increasing my CQ size did indeed solve the problem. So, I wanted to
understand why. I think the reason may be a bug in the mthca code that
comes with OFED.
My code creates a CQ of size 2072, and a SQ of size 2056, and a RQ of
size 16. As you can see, CQ = SQ + RQ. So, I should never overflow my
CQ.
The Driver raises each of these to the next power of two. So, we get a
CQ of size 4096, a SQ of size 4096, and an RQ of size 16.
As you can see, CQ < SQ + RQ, so it is possible to overflow the CQ.
I don't think that this should cause the Firmware to generate a
Catastrophic error (sounds like a bug in the firmware, if you ask me).
The CQ's size is increased in the function mthca_create_cq() in the file
mthca_provider.c. The SQ and RQ sizes are increased in the function
mthca_alloc_qp_common() in the file mthca_qp.c if and only if the
function mthca_is_memfree() returns TRUE; this function returns TRUE
when MTHCA_FLAG_MEMFREE is set in dev->mthca_flags, which it is for the
latest firmware release.
As I said, doubling the queue size solves the problem. However, it
would be better if the mthca driver did not create the problem in the
first place. If a QP is being created such that CQ >= SQ + RQ, then
that relationship should be maintained. Do others agree with me?
More information about the general
mailing list