[ofw] RE: MT25208 vendor status code translation?

Thu May 1 01:13:42 PDT 2008

See inline 

> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com] 
> Sent: Thursday, May 01, 2008 2:37 AM
> To: Leonid Keller; Tzachi Dar
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [ofw] RE: MT25208 vendor status code translation?
> 
> Leonid Keller wrote:
> > See inline
> > 
> >> -----Original Message-----
> >> From: ofw-bounces at lists.openfabrics.org 
> >> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
> >> Sent: Wednesday, April 30, 2008 3:59 AM
> >> To: Tzachi Dar
> >> Cc: ofw at lists.openfabrics.org
> >> Subject: [ofw] RE: MT25208 vendor status code translation?
> >> 
> >> 
> >> Forgot to mention the firmware is 4.08.0200 from vstat.
> >> 
> >> Smith, Stan wrote:
> >>> Hello,
> >>>   Can you point me to a document which would translate 
> and describe 
> >>> a MT25208 vendor status code reported in an IBAL 
> >>> ib_wc_t.vendor_specific field? The IBAL error reported is 
> >>> RNR_RETRY_ERR, curious as to what the vendor field value (0x87) 
> >>> implies.
> > 
> > 0x87 is the vendor code for RNR retry exceeded
> 
> 
> Thanks for the decode.
> 
> Turns out, as suspected, the receiver was not posting 
> receives fast enough, hence the rnr TO logic kicked in due to 
> a small rnr_retry_cnt with a short rnr_nak_timeout. Increased 
> both values - problem has gone away for now.
> 
> 
> > 
> >>> 
> >>> The problem we are attempting to understand is that in times of 
> >>> 'heavy' MPI induced system/node stress, the IBAL work-completion 
> >>> ib_wc_t.wr_id returns in the CQ callback handler set to 
> zero? Is was 
> >>> set as a valid pointer prior to the send post operation.
> > 
> > WQE's wr_id is not sent/received, it is kept in an array, 
> related to 
> > the WQE's QP.
> > Incorrect wr_id may be returned only when mthca_poll_one() 
> failed to 
> > find the QP, related to the CQE in question.
> > It prints "CQ entry for unknown QP %06x" warning in this case.
> > The failure to the QP may occur if the QP has been already 
> destroyed.
> > The driver code handles this situation, but may be there is still a 
> > bug, which comes true only under heavy stress.
> > 
> > Do you see the above warning when you get wr_id = 0 ?
> 
> 
> Found nothing in the system event log.

It is not there, it's a debugger output.
> 
> > 
> > 
> >> Without the
> >>> induced system stress (other MPI/DAPL jobs running) the 
> failing test 
> >>> runs for days.
> >>> 
> >>> Thanks,
> >>> 
> >>> Stan.
> >> 
> >> _______________________________________________
> >> ofw mailing list
> >> ofw at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 
>