[ofw] RE: MT25208 vendor status code translation?

Smith, Stan stan.smith at intel.com
Wed Apr 30 16:36:54 PDT 2008


Leonid Keller wrote:
> See inline
> 
>> -----Original Message-----
>> From: ofw-bounces at lists.openfabrics.org
>> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
>> Sent: Wednesday, April 30, 2008 3:59 AM
>> To: Tzachi Dar
>> Cc: ofw at lists.openfabrics.org
>> Subject: [ofw] RE: MT25208 vendor status code translation?
>> 
>> 
>> Forgot to mention the firmware is 4.08.0200 from vstat.
>> 
>> Smith, Stan wrote:
>>> Hello,
>>>   Can you point me to a document which would translate and describe
>>> a MT25208 vendor status code reported in an IBAL
>>> ib_wc_t.vendor_specific field? The IBAL error reported is
>>> RNR_RETRY_ERR, curious as to what the vendor field value (0x87)
>>> implies. 
> 
> 0x87 is the vendor code for RNR retry exceeded


Thanks for the decode.

Turns out, as suspected, the receiver was not posting receives fast
enough, hence the rnr TO logic kicked in due to a small rnr_retry_cnt
with a short rnr_nak_timeout. Increased both values - problem has gone
away for now.


> 
>>> 
>>> The problem we are attempting to understand is that in times of
>>> 'heavy' MPI induced system/node stress, the IBAL work-completion
>>> ib_wc_t.wr_id returns in the CQ callback handler set to zero? Is was
>>> set as a valid pointer prior to the send post operation.
> 
> WQE's wr_id is not sent/received, it is kept in an array, related to
> the WQE's QP.
> Incorrect wr_id may be returned only when mthca_poll_one() failed to
> find the QP, related to the CQE in question.
> It prints "CQ entry for unknown QP %06x" warning in this case.
> The failure to the QP may occur if the QP has been already destroyed.
> The driver code handles this situation, but may be there is still a
> bug, which comes true only under heavy stress.
> 
> Do you see the above warning when you get wr_id = 0 ?


Found nothing in the system event log.

> 
> 
>> Without the
>>> induced system stress (other MPI/DAPL jobs running) the failing
>>> test runs for days. 
>>> 
>>> Thanks,
>>> 
>>> Stan.
>> 
>> _______________________________________________
>> ofw mailing list
>> ofw at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw




More information about the ofw mailing list