[ofa-general] Help with an MTHCA "catastrophe"

Olivier Cozette olivier.cozette at seanodes.com
Thu Apr 12 05:55:06 PDT 2007


	Todd,
Sorry for this late reply,

> I am having similar issues with the same firmware.
> Can you give me some more details?

I have this bug on MT25204 (InfiniHost III Lx HCA memfree rev a0 PCI Express), 
on 30 nodes, with firmware 1.2.0 (the last from 26 December 2006). Note that 
i have no problem with my  MT23108 (InfiniHost 2MiB rev a1 PCI-X), this last 
board give a normal error when srq are empty when receiving a new buffer.

> Did you make the changes on the driver side or  the application?

In my application (my application directly use libibverbs), i just change the 
max number of completion event in completion queue ( ibv_vreate_cq() ) and 
the max number  of receive buffer (ibv_create_srq()), and i always post 
enough buffer in srq than needed by my apply conception (my apply can not 
receive more than N buffer without consumed some of them and tell to the 
sender it's ok).

With these changes, now my appli can no more receive more buffer than buffer 
posted in srq and always have enough place cq for all completion event 
(receive+send completion).

So now, i have no more catastrophic error, but i have sometimes "ib_mthca 
0000:0c:00.0: Async event for bogus QP 00180405", in this case the buffer was 
correctly sent (no error on sender) but receiver was not wake up in its 
ibv_get_cq_event().

> If on the driver, can you point me in the right direction to make those
> changes?

Perhaps, you change is only to increase you srq/cq length, post enought buffer 
in it, and add things to wake up your ibv_get_cq_event() after some timeout 
to see if ibv_poll_cq() can find something.

But, it seems that the men of openfabrics working on this bug " iser/lustre 
memfree issues"

Olivier




More information about the general mailing list