[ofa-general] Help with an MTHCA "catastrophe"
Olivier Cozette
olivier.cozette at seanodes.com
Thu Apr 12 05:55:06 PDT 2007
Todd,
Sorry for this late reply,
> I am having similar issues with the same firmware.
> Can you give me some more details?
I have this bug on MT25204 (InfiniHost III Lx HCA memfree rev a0 PCI Express),
on 30 nodes, with firmware 1.2.0 (the last from 26 December 2006). Note that
i have no problem with my MT23108 (InfiniHost 2MiB rev a1 PCI-X), this last
board give a normal error when srq are empty when receiving a new buffer.
> Did you make the changes on the driver side or the application?
In my application (my application directly use libibverbs), i just change the
max number of completion event in completion queue ( ibv_vreate_cq() ) and
the max number of receive buffer (ibv_create_srq()), and i always post
enough buffer in srq than needed by my apply conception (my apply can not
receive more than N buffer without consumed some of them and tell to the
sender it's ok).
With these changes, now my appli can no more receive more buffer than buffer
posted in srq and always have enough place cq for all completion event
(receive+send completion).
So now, i have no more catastrophic error, but i have sometimes "ib_mthca
0000:0c:00.0: Async event for bogus QP 00180405", in this case the buffer was
correctly sent (no error on sender) but receiver was not wake up in its
ibv_get_cq_event().
> If on the driver, can you point me in the right direction to make those
> changes?
Perhaps, you change is only to increase you srq/cq length, post enought buffer
in it, and add things to wake up your ibv_get_cq_event() after some timeout
to see if ibv_poll_cq() can find something.
But, it seems that the men of openfabrics working on this bug " iser/lustre
memfree issues"
Olivier
More information about the general
mailing list