[ofw] RE: MT25208 vendor status code translation?

Leonid Keller leonid at mellanox.co.il
Wed Apr 30 03:00:44 PDT 2008


See inline 

> -----Original Message-----
> From: ofw-bounces at lists.openfabrics.org 
> [mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
> Sent: Wednesday, April 30, 2008 3:59 AM
> To: Tzachi Dar
> Cc: ofw at lists.openfabrics.org
> Subject: [ofw] RE: MT25208 vendor status code translation?
> 
> 
> Forgot to mention the firmware is 4.08.0200 from vstat.
> 
> Smith, Stan wrote:
> > Hello,
> >   Can you point me to a document which would translate and 
> describe a
> > MT25208 vendor status code reported in an IBAL 
> ib_wc_t.vendor_specific 
> > field?
> > The IBAL error reported is RNR_RETRY_ERR, curious as to what the 
> > vendor field value (0x87) implies.

0x87 is the vendor code for RNR retry exceeded

> > 
> > The problem we are attempting to understand is that in times of 
> > 'heavy' MPI induced system/node stress, the IBAL work-completion 
> > ib_wc_t.wr_id returns in the CQ callback handler set to 
> zero? Is was 
> > set as a valid pointer prior to the send post operation. 

WQE's wr_id is not sent/received, it is kept in an array, related to the
WQE's QP.
Incorrect wr_id may be returned only when mthca_poll_one() failed to
find the QP, related to the CQE in question.
It prints "CQ entry for unknown QP %06x" warning in this case.
The failure to the QP may occur if the QP has been already destroyed.
The driver code handles this situation, but may be there is still a bug,
which comes true only under heavy stress.

Do you see the above warning when you get wr_id = 0 ?


> Without the 
> > induced system stress (other MPI/DAPL jobs running) the failing test
> > runs for days.    
> > 
> > Thanks,
> > 
> > Stan.
> 
> _______________________________________________
> ofw mailing list
> ofw at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
> 



More information about the ofw mailing list