[ofw] RE: MT25208 vendor status code translation?
Leonid Keller
leonid at mellanox.co.il
Thu May 1 01:13:42 PDT 2008
See inline
> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com]
> Sent: Thursday, May 01, 2008 2:37 AM
> To: Leonid Keller; Tzachi Dar
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [ofw] RE: MT25208 vendor status code translation?
>
> Leonid Keller wrote:
> > See inline
> >
> >> -----Original Message-----
> >> From: ofw-bounces at lists.openfabrics.org
> >> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
> >> Sent: Wednesday, April 30, 2008 3:59 AM
> >> To: Tzachi Dar
> >> Cc: ofw at lists.openfabrics.org
> >> Subject: [ofw] RE: MT25208 vendor status code translation?
> >>
> >>
> >> Forgot to mention the firmware is 4.08.0200 from vstat.
> >>
> >> Smith, Stan wrote:
> >>> Hello,
> >>> Can you point me to a document which would translate
> and describe
> >>> a MT25208 vendor status code reported in an IBAL
> >>> ib_wc_t.vendor_specific field? The IBAL error reported is
> >>> RNR_RETRY_ERR, curious as to what the vendor field value (0x87)
> >>> implies.
> >
> > 0x87 is the vendor code for RNR retry exceeded
>
>
> Thanks for the decode.
>
> Turns out, as suspected, the receiver was not posting
> receives fast enough, hence the rnr TO logic kicked in due to
> a small rnr_retry_cnt with a short rnr_nak_timeout. Increased
> both values - problem has gone away for now.
>
>
> >
> >>>
> >>> The problem we are attempting to understand is that in times of
> >>> 'heavy' MPI induced system/node stress, the IBAL work-completion
> >>> ib_wc_t.wr_id returns in the CQ callback handler set to
> zero? Is was
> >>> set as a valid pointer prior to the send post operation.
> >
> > WQE's wr_id is not sent/received, it is kept in an array,
> related to
> > the WQE's QP.
> > Incorrect wr_id may be returned only when mthca_poll_one()
> failed to
> > find the QP, related to the CQE in question.
> > It prints "CQ entry for unknown QP %06x" warning in this case.
> > The failure to the QP may occur if the QP has been already
> destroyed.
> > The driver code handles this situation, but may be there is still a
> > bug, which comes true only under heavy stress.
> >
> > Do you see the above warning when you get wr_id = 0 ?
>
>
> Found nothing in the system event log.
It is not there, it's a debugger output.
>
> >
> >
> >> Without the
> >>> induced system stress (other MPI/DAPL jobs running) the
> failing test
> >>> runs for days.
> >>>
> >>> Thanks,
> >>>
> >>> Stan.
> >>
> >> _______________________________________________
> >> ofw mailing list
> >> ofw at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>
>
More information about the ofw
mailing list