[ofa-general] IB post send lost.

Bharath Ramesh bramesh at vt.edu
Wed Nov 7 22:19:10 PST 2007


* Dotan Barak (dotanb at dev.mellanox.co.il) wrote:
> Hi.
>
> Bharath Ramesh wrote:
>> I have a multi-threaded application. My application has its own message
>> exchange protocol, it uses IB as the communication layer. I send a lot
>> of messages which are normally of the order of few ten thousands. After
>> sometime it seems like one message from one of the node is lost. I am
>> using RC QP type. This causes the thread to deadlock. The other threads
>> are still able to communicate exchanging messages without any problem
>> over the same QP. Both ends are using SRQs and there is sufficient
>> buffers posted so that I dont run out of buffers. I even tried doubling
>> the buffers posted I see the same problem again. One message being lost.
>> The ibv_post_send doesnt report any error. I am trying to get this done
>> for a conference deadline early next week. I would really appreciate any
>> help in suggesting any possibilities which might cause the message to be
>> dropped without any error being returned.
>>   
> If you don't have any bugs in your code, the described scenario should 
> work.
>
> I need some more info in order to try to help you:
>
> Do you use the same QP from several threads (and post send from all of 
> them)?

Yes, I use the same the QP from three threads. The application has close
to 5 threads. The receives are handled by a single thread. Most of the
sends are posted by a single thread. Occasionally a third thread posts a
few sends to the QP. The same QP is also used for RDMA Writes. Majority
of the RDMA Writes are also performed by the same thread that posts
majority of the send messages.

> How do you poll the CQ (several threads/one)?

I have two CQs, one for receive and the other for send. The receive CQ
is polled only by the receive thread. The send CQ is polled by the three
threads. Occasionally by the receiver thread to clear out an send CQEs
because I use IBV_SEND_SIGNALED for every 16 IBV_SEND_INLINEs. Otherwise
the send CQ is polled by the single thread that does majority of the
sends. Occasionally the third thread when doing a send might poll the
send CQ as well for completion CQE in case of a RDMA Write.

>
> which HW/SW do you use?

I am using Yellow Dog Linux 5.0 on Apple Xserves.

Thanks,

Bharath

---
Bharath Ramesh       <bramesh at vt.edu>       http://people.cs.vt.edu/~bramesh




More information about the general mailing list