[ofa-general] Recalculated Queue Sizes caused mthca Catastrophic Errors

Steve Wise swise at opengridcomputing.com
Wed Feb 20 13:37:50 PST 2008


Roger Spellman wrote:
> Hello,
> I have a Mellanox MT25204, running the latest firmware.  A few days ago,
> I was getting Catastrophic errors from the firmware.  I found the
> following in the Release Notes for RHEL-5:
> 
>        Hardware testing for the Mellanox MT25204 has revealed 
>        that an internal error occurs under certain high-load 
>        conditions. When the ib_mthca driver reports a catastrophic 
>        error on this hardware, it is usually related to an insufficient 
>        completion queue depth relative to the number of outstanding work
> 
>        requests generated by the user application.
> 
> Increasing my CQ size did indeed solve the problem.  So, I wanted to
> understand why.  I think the reason may be a bug in the mthca code that
> comes with OFED.  
> 
> My code creates a CQ of size 2072, and a SQ of size 2056, and a RQ of
> size 16.  As you can see, CQ = SQ + RQ.  So, I should never overflow my
> CQ.
> 
> The Driver raises each of these to the next power of two.  So, we get a
> CQ of size 4096, a SQ of size 4096, and an RQ of size 16.
> 
> As you can see, CQ < SQ + RQ, so it is possible to overflow the CQ.
> 
> I don't think that this should cause the Firmware to generate a
> Catastrophic error (sounds like a bug in the firmware, if you ask me).
> 
> The CQ's size is increased in the function mthca_create_cq() in the file
> mthca_provider.c.  The SQ and RQ sizes are increased in the function
> mthca_alloc_qp_common() in the file mthca_qp.c if and only if the
> function mthca_is_memfree() returns TRUE; this function returns TRUE
> when MTHCA_FLAG_MEMFREE is set in dev->mthca_flags, which it is for the
> latest firmware release.
> 
> As I said, doubling the queue size solves the problem.  However, it
> would be better if the mthca driver did not create the problem in the
> first place.  If a QP is being created such that CQ >= SQ + RQ, then
> that relationship should be maintained.  Do others agree with me?

The driver cannot really ensure this because the CQ might be used for 
more than one QP.

But this issue still raises questions in my mind how an application 
_should_ handle this condition?  IE If the app is required to ensure the 
CQ is big enough, how does it deal with the case where the driver 
allocates a bigger QP?  Resizing the QP after creating the QP and 
discovering via a query that the QP is too big for the CQs?

Steve.






More information about the general mailing list