[ofa-general] IB post send lost.
Dotan Barak
dotanb at dev.mellanox.co.il
Mon Nov 12 06:43:26 PST 2007
Hi.
how much times does it take to reproduce this failure?
thanks
Dotan
Bharath Ramesh wrote:
> * Dotan Barak (dotanb at dev.mellanox.co.il) wrote:
>
>> Hi.
>>
>> i need some more info.
>>
>> Which IB HW do you use?
>> (you can get this info from ibv_devinfo)
>>
>
> The IB HW used are the Mellanox Cougar Cards.
>
> output of ibv_devinfo:
> hca_id: mthca0
> fw_ver: 3.5.0
> node_guid: 0002:c901:08fe:76a0
> sys_image_guid: 0002:c901:08fe:76a3
> vendor_id: 0x02c9
> vendor_part_id: 23108
> hw_ver: 0xA1
> board_id: MT_0000000001
> phys_port_cnt: 2
>
>
>> Which IB SW do you use?
>> (you can get this info from ofed_info)
>>
>
> The IB SW I am using is OFED 1.2. The linux kernel used are
> 2.6.21.1-xserve
>
> I am not sure if this might help. Basically every time I send a message
> I wait for an ack to be received. I wait on a pthread_cond_wait. Since
> the message gets dropped my thread is blocked on pthread_cond_wait
> forever. The other thread which occasionally sends messages is still
> able to send/receive messages over the QP. Block for the ack and receive
> the ack while this thread never receives the ack because of the dropped
> message. To verify if the messages were being dropped I printed every
> single message being sent and received on either ends. The dropped
> message is sent but the receiver never receives it.
>
> Thanks,
>
> Bharath
>
>
>> Dotan
>>
>> Bharath Ramesh wrote:
>>
>>> * Dotan Barak (dotanb at dev.mellanox.co.il) wrote:
>>>
>>>
>>>> Hi.
>>>>
>>>> Bharath Ramesh wrote:
>>>>
>>>>
>>>>> I have a multi-threaded application. My application has its own message
>>>>> exchange protocol, it uses IB as the communication layer. I send a lot
>>>>> of messages which are normally of the order of few ten thousands. After
>>>>> sometime it seems like one message from one of the node is lost. I am
>>>>> using RC QP type. This causes the thread to deadlock. The other threads
>>>>> are still able to communicate exchanging messages without any problem
>>>>> over the same QP. Both ends are using SRQs and there is sufficient
>>>>> buffers posted so that I dont run out of buffers. I even tried doubling
>>>>> the buffers posted I see the same problem again. One message being lost.
>>>>> The ibv_post_send doesnt report any error. I am trying to get this done
>>>>> for a conference deadline early next week. I would really appreciate any
>>>>> help in suggesting any possibilities which might cause the message to be
>>>>> dropped without any error being returned.
>>>>>
>>>>>
>>>> If you don't have any bugs in your code, the described scenario should
>>>> work.
>>>>
>>>> I need some more info in order to try to help you:
>>>>
>>>> Do you use the same QP from several threads (and post send from all of
>>>> them)?
>>>>
>>>>
>>> Yes, I use the same the QP from three threads. The application has close
>>> to 5 threads. The receives are handled by a single thread. Most of the
>>> sends are posted by a single thread. Occasionally a third thread posts a
>>> few sends to the QP. The same QP is also used for RDMA Writes. Majority
>>> of the RDMA Writes are also performed by the same thread that posts
>>> majority of the send messages.
>>>
>>>
>>>
>>>> How do you poll the CQ (several threads/one)?
>>>>
>>>>
>>> I have two CQs, one for receive and the other for send. The receive CQ
>>> is polled only by the receive thread. The send CQ is polled by the three
>>> threads. Occasionally by the receiver thread to clear out an send CQEs
>>> because I use IBV_SEND_SIGNALED for every 16 IBV_SEND_INLINEs. Otherwise
>>> the send CQ is polled by the single thread that does majority of the
>>> sends. Occasionally the third thread when doing a send might poll the
>>> send CQ as well for completion CQE in case of a RDMA Write.
>>>
>>>
>>>
>>>> which HW/SW do you use?
>>>>
>>>>
>>> I am using Yellow Dog Linux 5.0 on Apple Xserves.
>>>
>>> Thanks,
>>>
>>> Bharath
>>>
>>> ---
>>> Bharath Ramesh <bramesh at vt.edu>
>>> http://people.cs.vt.edu/~bramesh
>>>
>>>
>>>
>>>
>
> ---
> Bharath Ramesh <bramesh at vt.edu> http://people.cs.vt.edu/~bramesh
>
>
>
More information about the general
mailing list