[ofa-general] Help with an MTHCA "catastrophe"

Olivier Cozette olivier.cozette at seanodes.com
Tue Apr 10 01:56:43 PDT 2007


	Hi,

I had the same error with my driver, and after some investigation, i found 
that my srq depth and cq depth was too small to handle the maximum number of 
send/recv that my application can generate concurently. Normally, in that 
case the qp state must become error state, but instead of that a catastrophic 
error occur.

I increased the srq/cq depth to meet the maximum send/recv that my application 
can generate concurently (without reply/synchro) and this bug no more occur.

So, you probably just need to increase your srq/cq depth and post buffer to 
meet the maximum send/recv that your driver can do.

	Olivier

Note : I have a MT25204 rev a0 firware 1.2.0.

Le Mardi 20 Mars 2007 18:59, Eric Barton a écrit :
> The following is console output immediately before a panic on a system
> running lustre with OFED 1.1.  How can I find out what it means?
>
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected:
> internal error 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[00]:
> 001d79f4
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[01]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[02]: 00198538
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[03]: 00136038
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[04]: 00207730
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[05]: 001d79cc
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[06]: 0023cf24
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[07]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[08]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[09]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0a]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0b]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0c]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0d]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0e]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0f]: 00000000
>
> ...shortly before it happens, the lustre/lnet OFED driver receives a number
> of what I believe to be duplicate SEND completion events.  It seems quite
> sporadic, and doesn't appear to track hardware.
>
> More info at https://bugzilla.lustre.org/show_bug.cgi?id=11381
>
>                 Cheers,
>                         Eric
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list