[ofw] RE: MT25208 vendor status code translation?
Leonid Keller
leonid at mellanox.co.il
Wed Apr 30 03:00:44 PDT 2008
See inline
> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
> Sent: Wednesday, April 30, 2008 3:59 AM
> To: Tzachi Dar
> Cc: ofw at lists.openfabrics.org
> Subject: [ofw] RE: MT25208 vendor status code translation?
>
>
> Forgot to mention the firmware is 4.08.0200 from vstat.
>
> Smith, Stan wrote:
> > Hello,
> > Can you point me to a document which would translate and
> describe a
> > MT25208 vendor status code reported in an IBAL
> ib_wc_t.vendor_specific
> > field?
> > The IBAL error reported is RNR_RETRY_ERR, curious as to what the
> > vendor field value (0x87) implies.
0x87 is the vendor code for RNR retry exceeded
> >
> > The problem we are attempting to understand is that in times of
> > 'heavy' MPI induced system/node stress, the IBAL work-completion
> > ib_wc_t.wr_id returns in the CQ callback handler set to
> zero? Is was
> > set as a valid pointer prior to the send post operation.
WQE's wr_id is not sent/received, it is kept in an array, related to the
WQE's QP.
Incorrect wr_id may be returned only when mthca_poll_one() failed to
find the QP, related to the CQE in question.
It prints "CQ entry for unknown QP %06x" warning in this case.
The failure to the QP may occur if the QP has been already destroyed.
The driver code handles this situation, but may be there is still a bug,
which comes true only under heavy stress.
Do you see the above warning when you get wr_id = 0 ?
> Without the
> > induced system stress (other MPI/DAPL jobs running) the failing test
> > runs for days.
> >
> > Thanks,
> >
> > Stan.
>
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>
More information about the ofw
mailing list