[ofa-general] Bogus Receive Completions

Roman Kononov ofed at kononov.ftml.net
Fri Sep 5 13:07:34 PDT 2008


This is continuation of
http://lists.openfabrics.org/pipermail/general/2007-December/043658.html

Basically, I have two processes on different computers talking to each other
over a single QP per process. They both post and receive
IBV_WR_RDMA_WRITE_WITH_IMM commands.

All Send Work Requests are sequentially numbered in wr_id field. When the
process receives Send Work Completion, wr_id is checked for consistency with
the posted number. So far so good.

All Receive Work Requests are sequentially numbered in wr_id field as well.
When the process gets a Receive Work Completion, wr_id is checked for
consistency with the posted number. The consistency test eventually fails.
The Completion status is "success", wr_id is out of order.

I believe that wr_id from Receive Work Completions must arrive in order, but
they do not.

I managed to reproduce the failure reliably in my environment. Then I
modified mthca_tavor_post_recv(), mthca_tavor_post_send() to print all
wr->wr_id values passing through them, and I modified mthca_poll_cq() to
print all valid wc->wr_id values passing through it. The results from the
two processes are attached. In stdout.1.log, one can see that a Receive Work
Request with wr_id=0x7f was accepted and immediately completed, while the
Receive Queue has 0x7f-0x40=0x3f uncompleted Work Requests. None
mthca_tavor_post_recv() calls returned an error.

This looks like a bug in libmthca or the firmware. I really need this fixed.
Where should go from this point? Any suggestions are appreciated.

The QP is created with both SQ and RQ sizes set to 64, with a single CQ. The
CQ size is set to 128.

I have libibverbs-1.1.2 and libmthca-1.0.5 compiled from sources.

~>cat /etc/issue
CentOS release 5.2 (Final)
Kernel \r on an \m
~>uname -a
Linux node100 2.6.26.3 #1 SMP PREEMPT Wed Sep 3 14:11:03 CDT 2008 x86_64
x86_64 x86_64 GNU/Linux
~>grep 'model name' /proc/cpuinfo
model name      : Dual Core AMD Opteron(tm) Processor 285
model name      : Dual Core AMD Opteron(tm) Processor 285
~>ibv_devinfo
hca_id: mthca0
           fw_ver:                         4.8.200
           node_guid:                      0002:c902:0026:dbe0
           sys_image_guid:                 0002:c902:0026:dbe3
           vendor_id:                      0x02c9
           vendor_part_id:                 25208
           hw_ver:                         0xA0
           board_id:                       MT_02F0110002
           phys_port_cnt:                  2
...

Thanks,
Roman Kononov


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stdout.1.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080905/3c346f33/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stdout.2.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080905/3c346f33/attachment-0001.ksh>


More information about the general mailing list