[ofa-general] IB post send lost.

Dotan Barak dotanb at dev.mellanox.co.il
Mon Nov 12 06:43:26 PST 2007


Hi.

how much times does it take to reproduce this failure?

thanks
Dotan

Bharath Ramesh wrote:
> * Dotan Barak (dotanb at dev.mellanox.co.il) wrote:
>   
>> Hi.
>>
>> i need some more info.
>>
>> Which IB HW do you use?
>> (you can get this info from ibv_devinfo)
>>     
>
> The IB HW used are the Mellanox Cougar Cards.
>
> output of ibv_devinfo:
> hca_id: mthca0
>         fw_ver:                         3.5.0
> 	node_guid:                      0002:c901:08fe:76a0
> 	sys_image_guid:			0002:c901:08fe:76a3
> 	vendor_id:                      0x02c9
> 	vendor_part_id:			23108
> 	hw_ver:				0xA1
> 	board_id:			MT_0000000001
> 	phys_port_cnt:			2
>
>   
>> Which IB SW do you use?
>> (you can get this info from ofed_info)
>>     
>
> The IB SW I am using is OFED 1.2. The linux kernel used are
> 2.6.21.1-xserve
>
> I am not sure if this might help. Basically every time I send a message
> I wait for an ack to be received. I wait on a pthread_cond_wait. Since
> the message gets dropped my thread is blocked on pthread_cond_wait
> forever. The other thread which occasionally sends messages is still
> able to send/receive messages over the QP. Block for the ack and receive
> the ack while this thread never receives the ack because of the dropped
> message. To verify if the messages were being dropped I printed every
> single message being sent and received on either ends. The dropped
> message is sent but the receiver never receives it.
>
> Thanks,
>
> Bharath
>
>   
>> Dotan
>>
>> Bharath Ramesh wrote:
>>     
>>> * Dotan Barak (dotanb at dev.mellanox.co.il) wrote:
>>>   
>>>       
>>>> Hi.
>>>>
>>>> Bharath Ramesh wrote:
>>>>     
>>>>         
>>>>> I have a multi-threaded application. My application has its own message
>>>>> exchange protocol, it uses IB as the communication layer. I send a lot
>>>>> of messages which are normally of the order of few ten thousands. After
>>>>> sometime it seems like one message from one of the node is lost. I am
>>>>> using RC QP type. This causes the thread to deadlock. The other threads
>>>>> are still able to communicate exchanging messages without any problem
>>>>> over the same QP. Both ends are using SRQs and there is sufficient
>>>>> buffers posted so that I dont run out of buffers. I even tried doubling
>>>>> the buffers posted I see the same problem again. One message being lost.
>>>>> The ibv_post_send doesnt report any error. I am trying to get this done
>>>>> for a conference deadline early next week. I would really appreciate any
>>>>> help in suggesting any possibilities which might cause the message to be
>>>>> dropped without any error being returned.
>>>>>         
>>>>>           
>>>> If you don't have any bugs in your code, the described scenario should 
>>>> work.
>>>>
>>>> I need some more info in order to try to help you:
>>>>
>>>> Do you use the same QP from several threads (and post send from all of 
>>>> them)?
>>>>     
>>>>         
>>> Yes, I use the same the QP from three threads. The application has close
>>> to 5 threads. The receives are handled by a single thread. Most of the
>>> sends are posted by a single thread. Occasionally a third thread posts a
>>> few sends to the QP. The same QP is also used for RDMA Writes. Majority
>>> of the RDMA Writes are also performed by the same thread that posts
>>> majority of the send messages.
>>>
>>>   
>>>       
>>>> How do you poll the CQ (several threads/one)?
>>>>     
>>>>         
>>> I have two CQs, one for receive and the other for send. The receive CQ
>>> is polled only by the receive thread. The send CQ is polled by the three
>>> threads. Occasionally by the receiver thread to clear out an send CQEs
>>> because I use IBV_SEND_SIGNALED for every 16 IBV_SEND_INLINEs. Otherwise
>>> the send CQ is polled by the single thread that does majority of the
>>> sends. Occasionally the third thread when doing a send might poll the
>>> send CQ as well for completion CQE in case of a RDMA Write.
>>>
>>>   
>>>       
>>>> which HW/SW do you use?
>>>>     
>>>>         
>>> I am using Yellow Dog Linux 5.0 on Apple Xserves.
>>>
>>> Thanks,
>>>
>>> Bharath
>>>
>>> ---
>>> Bharath Ramesh       <bramesh at vt.edu>       
>>> http://people.cs.vt.edu/~bramesh
>>>
>>>
>>>   
>>>       
>
> ---
> Bharath Ramesh       <bramesh at vt.edu>       http://people.cs.vt.edu/~bramesh
>
>
>   




More information about the general mailing list