Olivier,<br><br>I am having similar issues with the same firmware.<br>Can you give me some more details?<br><br>Did you make the changes on the driver side or the application?<br>If on the driver, can you point me in the right direction to make those changes?
<br><br>Thanks,<br>Todd<br><br><div><span class="gmail_quote">On 4/10/07, <b class="gmail_sendername">Olivier Cozette</b> <<a href="mailto:olivier.cozette@seanodes.com">olivier.cozette@seanodes.com</a>> wrote:</span>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Hi,<br><br>I had the same error with my driver, and after some investigation, i found
<br>that my srq depth and cq depth was too small to handle the maximum number of<br>send/recv that my application can generate concurently. Normally, in that<br>case the qp state must become error state, but instead of that a catastrophic
<br>error occur.<br><br>I increased the srq/cq depth to meet the maximum send/recv that my application<br>can generate concurently (without reply/synchro) and this bug no more occur.<br><br>So, you probably just need to increase your srq/cq depth and post buffer to
<br>meet the maximum send/recv that your driver can do.<br><br> Olivier<br><br>Note : I have a MT25204 rev a0 firware 1.2.0.<br><br>Le Mardi 20 Mars 2007 18:59, Eric Barton a écrit:<br>> The following is console output immediately before a panic on a system
<br>> running lustre with OFED 1.1. How can I find out what it means?<br>><br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: Catastrophic error detected:<br>> internal error 2007-02-21 12:02:42 ib_mthca 0000:07:
00.0: buf[00]:<br>> 001d79f4<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[01]: 00000000<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[02]: 00198538<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:
00.0: buf[03]: 00136038<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[04]: 00207730<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[05]: 001d79cc<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[06]: 0023cf24
<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[07]: 00000000<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[08]: 00000000<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[09]: 00000000<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:
00.0: buf[0a]: 00000000<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0b]: 00000000<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0c]: 00000000<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0d]: 00000000
<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0e]: 00000000<br>> 2007-02-21 12:02:42 ib_mthca 0000:07:00.0: buf[0f]: 00000000<br>><br>> ...shortly before it happens, the lustre/lnet OFED driver receives a number
<br>> of what I believe to be duplicate SEND completion events. It seems quite<br>> sporadic, and doesn't appear to track hardware.<br>><br>> More info at <a href="https://bugzilla.lustre.org/show_bug.cgi?id=11381">
https://bugzilla.lustre.org/show_bug.cgi?id=11381</a><br>><br>> Cheers,<br>> Eric<br>><br>><br>> _______________________________________________<br>> general mailing list
<br>> <a href="mailto:general@lists.openfabrics.org">general@lists.openfabrics.org</a><br>> <a href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general">http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
</a><br>><br>> To unsubscribe, please visit<br>> <a href="http://openib.org/mailman/listinfo/openib-general">http://openib.org/mailman/listinfo/openib-general</a><br>_______________________________________________
<br>general mailing list<br><a href="mailto:general@lists.openfabrics.org">general@lists.openfabrics.org</a><br><a href="http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general">http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
</a><br><br>To unsubscribe, please visit <a href="http://openib.org/mailman/listinfo/openib-general">http://openib.org/mailman/listinfo/openib-general</a><br></blockquote></div><br>