strange mem-free bug (was: [openib-general] completion Q overflow error/panic)
Roland Dreier
rolandd at cisco.com
Mon Sep 12 17:30:10 PDT 2005
While looking at Viswa's example, I've found what seems to be a
problem using lots of QPs on mem-free HCAs. This could easily be an
mthca driver bug, but I'd appreciate it if Mellanox would take a look
and help track down the issue. I looked at the mthca code and don't
see anything wrong, so either narrowing down the software bug or
telling me it's actually a FW/HW bug would be great.
I'm attaching a fairly simple program that shows the problem on my
systems. It just creates a bunch of QPs and has one side send one
message from each QP. The other side waits for receives and sends a
reply back for every receive it gets. When all the replies are
received, it loops around and does it again.
To build the example, just do:
gcc -o rc-test rc-test.c -libverbs
To run, do
rc-test
on one system, and
rc-test <listening address>
on the other. In fact, I can reproduce the problem even on a single
system just with
rc-test &
rc-test localhost
On a system with a PCI-X HCA, this works perfectly. However, on a
system with Arbel HCAs (with mem-free FW 5.1.0), I get the following
output (going on forever):
local address: LID 0x0008
remote address: LID 0x0007
After 1.000066 sec, 104/4000 comps
After 2.000276 sec, 104/4000 comps
After 3.000295 sec, 104/4000 comps
After 4.000332 sec, 104/4000 comps
After 5.000375 sec, 104/4000 comps
which shows that only 104 out of the 4000 send/receive pairs ever
complete. On the other side I see the same number of completions. It
seems the HCA loses a bunch of doorbells, although an IPoIB traffic
running in the background continues fine.
Viswa seems to have seen the same problem with Sinai & FW 1.0.1.
Let me know if you need more info.
Thanks,
Roland
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rc-test.c
Type: text/x-csrc
Size: 16011 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050912/f0dd9e3f/attachment.c>
More information about the general
mailing list