[ofa-general] Help with an MTHCA "catastrophe"

Todd Bowman twbowman at gmail.com
Tue Apr 10 08:11:04 PDT 2007


Olivier,

I am having similar issues with the same firmware.
Can you give me some more details?

Did you make the changes on the driver side or  the application?
If on the driver, can you point me in the right direction to make those
changes?

Thanks,
Todd

On 4/10/07, Olivier Cozette <olivier.cozette at seanodes.com> wrote:
>
>         Hi,
>
> I had the same error with my driver, and after some investigation, i found
> that my srq depth and cq depth was too small to handle the maximum number
> of
> send/recv that my application can generate concurently. Normally, in that
> case the qp state must become error state, but instead of that a
> catastrophic
> error occur.
>
> I increased the srq/cq depth to meet the maximum send/recv that my
> application
> can generate concurently (without reply/synchro) and this bug no more
> occur.
>
> So, you probably just need to increase your srq/cq depth and post buffer
> to
> meet the maximum send/recv that your driver can do.
>
>         Olivier
>
> Note : I have a MT25204 rev a0 firware 1.2.0.
>
> Le Mardi 20 Mars 2007 18:59, Eric Barton a écrit:
> > The following is console output immediately before a panic on a system
> > running lustre with OFED 1.1.  How can I find out what it means?
> >
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected:
> > internal error 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[00]:
> > 001d79f4
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[01]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[02]: 00198538
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[03]: 00136038
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[04]: 00207730
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[05]: 001d79cc
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[06]: 0023cf24
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[07]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[08]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[09]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0a]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0b]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0c]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0d]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0e]: 00000000
> > 2007-02-21 12:02:42 ib_mthca 0000:07:00.0:   buf[0f]: 00000000
> >
> > ...shortly before it happens, the lustre/lnet OFED driver receives a
> number
> > of what I believe to be duplicate SEND completion events.  It seems
> quite
> > sporadic, and doesn't appear to track hardware.
> >
> > More info at https://bugzilla.lustre.org/show_bug.cgi?id=11381
> >
> >                 Cheers,
> >                         Eric
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070410/87c66174/attachment.html>


More information about the general mailing list