[ofa-general] IB post send lost.
Bharath Ramesh
bramesh at vt.edu
Thu Nov 8 09:20:50 PST 2007
* Dotan Barak (dotanb at dev.mellanox.co.il) wrote:
> Hi.
>
> i need some more info.
>
> Which IB HW do you use?
> (you can get this info from ibv_devinfo)
The IB HW used are the Mellanox Cougar Cards.
output of ibv_devinfo:
hca_id: mthca0
fw_ver: 3.5.0
node_guid: 0002:c901:08fe:76a0
sys_image_guid: 0002:c901:08fe:76a3
vendor_id: 0x02c9
vendor_part_id: 23108
hw_ver: 0xA1
board_id: MT_0000000001
phys_port_cnt: 2
>
> Which IB SW do you use?
> (you can get this info from ofed_info)
The IB SW I am using is OFED 1.2. The linux kernel used are
2.6.21.1-xserve
I am not sure if this might help. Basically every time I send a message
I wait for an ack to be received. I wait on a pthread_cond_wait. Since
the message gets dropped my thread is blocked on pthread_cond_wait
forever. The other thread which occasionally sends messages is still
able to send/receive messages over the QP. Block for the ack and receive
the ack while this thread never receives the ack because of the dropped
message. To verify if the messages were being dropped I printed every
single message being sent and received on either ends. The dropped
message is sent but the receiver never receives it.
Thanks,
Bharath
>
>
> Dotan
>
> Bharath Ramesh wrote:
>> * Dotan Barak (dotanb at dev.mellanox.co.il) wrote:
>>
>>> Hi.
>>>
>>> Bharath Ramesh wrote:
>>>
>>>> I have a multi-threaded application. My application has its own message
>>>> exchange protocol, it uses IB as the communication layer. I send a lot
>>>> of messages which are normally of the order of few ten thousands. After
>>>> sometime it seems like one message from one of the node is lost. I am
>>>> using RC QP type. This causes the thread to deadlock. The other threads
>>>> are still able to communicate exchanging messages without any problem
>>>> over the same QP. Both ends are using SRQs and there is sufficient
>>>> buffers posted so that I dont run out of buffers. I even tried doubling
>>>> the buffers posted I see the same problem again. One message being lost.
>>>> The ibv_post_send doesnt report any error. I am trying to get this done
>>>> for a conference deadline early next week. I would really appreciate any
>>>> help in suggesting any possibilities which might cause the message to be
>>>> dropped without any error being returned.
>>>>
>>> If you don't have any bugs in your code, the described scenario should
>>> work.
>>>
>>> I need some more info in order to try to help you:
>>>
>>> Do you use the same QP from several threads (and post send from all of
>>> them)?
>>>
>>
>> Yes, I use the same the QP from three threads. The application has close
>> to 5 threads. The receives are handled by a single thread. Most of the
>> sends are posted by a single thread. Occasionally a third thread posts a
>> few sends to the QP. The same QP is also used for RDMA Writes. Majority
>> of the RDMA Writes are also performed by the same thread that posts
>> majority of the send messages.
>>
>>
>>> How do you poll the CQ (several threads/one)?
>>>
>>
>> I have two CQs, one for receive and the other for send. The receive CQ
>> is polled only by the receive thread. The send CQ is polled by the three
>> threads. Occasionally by the receiver thread to clear out an send CQEs
>> because I use IBV_SEND_SIGNALED for every 16 IBV_SEND_INLINEs. Otherwise
>> the send CQ is polled by the single thread that does majority of the
>> sends. Occasionally the third thread when doing a send might poll the
>> send CQ as well for completion CQE in case of a RDMA Write.
>>
>>
>>> which HW/SW do you use?
>>>
>>
>> I am using Yellow Dog Linux 5.0 on Apple Xserves.
>>
>> Thanks,
>>
>> Bharath
>>
>> ---
>> Bharath Ramesh <bramesh at vt.edu>
>> http://people.cs.vt.edu/~bramesh
>>
>>
>>
>
---
Bharath Ramesh <bramesh at vt.edu> http://people.cs.vt.edu/~bramesh
More information about the general
mailing list