strange mem-free bug (was: [openib-general] completion Q overflow error/panic)

Roland Dreier rolandd at cisco.com
Mon Sep 12 17:30:10 PDT 2005


While looking at Viswa's example, I've found what seems to be a
problem using lots of QPs on mem-free HCAs.  This could easily be an
mthca driver bug, but I'd appreciate it if Mellanox would take a look
and help track down the issue.  I looked at the mthca code and don't
see anything wrong, so either narrowing down the software bug or
telling me it's actually a FW/HW bug would be great.

I'm attaching a fairly simple program that shows the problem on my
systems.  It just creates a bunch of QPs and has one side send one
message from each QP.  The other side waits for receives and sends a
reply back for every receive it gets.  When all the replies are
received, it loops around and does it again.

To build the example, just do:

   gcc -o rc-test rc-test.c -libverbs

To run, do

    rc-test

on one system, and

    rc-test <listening address>

on the other.  In fact, I can reproduce the problem even on a single
system just with

    rc-test &
    rc-test localhost

On a system with a PCI-X HCA, this works perfectly.  However, on a
system with Arbel HCAs (with mem-free FW 5.1.0), I get the following
output (going on forever):

      local address:  LID 0x0008
      remote address: LID 0x0007
    After 1.000066 sec, 104/4000 comps
    After 2.000276 sec, 104/4000 comps
    After 3.000295 sec, 104/4000 comps
    After 4.000332 sec, 104/4000 comps
    After 5.000375 sec, 104/4000 comps

which shows that only 104 out of the 4000 send/receive pairs ever
complete.  On the other side I see the same number of completions.  It
seems the HCA loses a bunch of doorbells, although an IPoIB traffic
running in the background continues fine.

Viswa seems to have seen the same problem with Sinai & FW 1.0.1.

Let me know if you need more info.

Thanks,
  Roland

-------------- next part --------------
A non-text attachment was scrubbed...
Name: rc-test.c
Type: text/x-csrc
Size: 16011 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050912/f0dd9e3f/attachment.c>


More information about the general mailing list