[ofa-general] Help with an MTHCA "catastrophe"

Olivier Cozette olivier.cozette at seanodes.com
Mon Apr 2 03:06:22 PDT 2007


	Hello,

I have the same problem with my program that use libibverbs (srq+remote write) 
on a MT25204 (InfiniHost III Lx HCA rev a0 firmware 1.2.000) on a 30 nodes 
cluster. In this environement, we reboot some nodes regularly (for test), but 
in this case we get a regular error in wc with MT23108 (InfiniHost rev a1 
firmware 3.4.0) and only break connection with dead/rebooted nodes. Note that 
the reset of the HCA of the last OFED in not a issue, because we don't want 
to break connection with working nodes !

Did you know some workaround ?

	Best regards,
	Olivier


ib_mthca 0000:0c:00.0: Catastrophic error detected: internal error
ib_mthca 0000:0c:00.0:   buf[00]: 0012f6f8
ib_mthca 0000:0c:00.0:   buf[01]: 00000000
ib_mthca 0000:0c:00.0:   buf[02]: 00000000
ib_mthca 0000:0c:00.0:   buf[03]: 00000000
ib_mthca 0000:0c:00.0:   buf[04]: 00000000
ib_mthca 0000:0c:00.0:   buf[05]: 0012f6dc
ib_mthca 0000:0c:00.0:   buf[06]: 0018753c
ib_mthca 0000:0c:00.0:   buf[07]: 00000000
ib_mthca 0000:0c:00.0:   buf[08]: 00000000
ib_mthca 0000:0c:00.0:   buf[09]: 00000000
ib_mthca 0000:0c:00.0:   buf[0a]: 00000000
ib_mthca 0000:0c:00.0:   buf[0b]: 00000000
ib_mthca 0000:0c:00.0:   buf[0c]: 00000000
ib_mthca 0000:0c:00.0:   buf[0d]: 00000000
ib_mthca 0000:0c:00.0:   buf[0e]: 00000000
ib_mthca 0000:0c:00.0:   buf[0f]: 00000000


Le Dimanche 1 Avril 2007 11:03, Ariel Shachar a écrit :
> bug 40567
>
>
> -----Original Message-----
> From: general-bounces at lists.openfabrics.org
> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Eric Barton
> Sent: Tuesday, March 20, 2007 8:00 PM
> To: general at lists.openfabrics.org
> Subject: [ofa-general] Help with an MTHCA "catastrophe"
>
>
>
> The following is console output immediately before a panic on a system
> running lustre with OFED 1.1.  How can I find out what it
> means?
>
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected:
> internal error
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[00]: 001d79f4
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[01]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[02]: 00198538
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[03]: 00136038
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[04]: 00207730
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[05]: 001d79cc
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[06]: 0023cf24
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[07]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[08]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[09]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0a]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0b]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0c]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0d]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0e]: 00000000
> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0f]: 00000000
>
> ...shortly before it happens, the lustre/lnet OFED driver receives a
> number of what I believe to be duplicate SEND completion
> events.  It seems quite sporadic, and doesn't appear to track hardware.
>
> More info at https://bugzilla.lustre.org/show_bug.cgi?id=11381
>
>                 Cheers,
>                         Eric
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list