[openib-general] resend [PATCH] rping.c: Fix hang if either the server or the client exits early

Pradipta Kumar Banerjee bpradip at in.ibm.com
Fri Jun 23 09:09:50 PDT 2006


Steve Wise wrote:
> On Fri, 2006-06-23 at 18:20 +0530, Pradipta Kumar Banerjee wrote:
>> Steve Wise wrote:
>>> The goal of adding the return codes was so that the rping program could
>>> exit with a status indicating success or failure.  Every rping run
>>> results in a DISCONNECT event, so I don't think we want to treat that
>>> case as an error.
>> DISCONNECT event will be generated when the connection is closed or in case of 
>> some error (like CCAE_LLP_CONNECTION_LOST, CCAE_BAD_CLOSE in case of Ammasso 
>> driver etc).
> 
> You'll also get the DISCONNECT event when one side finished the rping
> loops and does rdma_disconnect().  So receiving that event isn't
> necessarily an error...
Yes definitely, but this event can _also_ be received due to errors!!
> 
> 
>>> Also, can you explain why thi fixes Amith's problem, which sounded like
>>> a process was hanging?
>>>
>> On debugging I found that the main thread was blocked in ibv_destroy_cq(), 
>> cm_thread was blocked in rdma_get_cm_event->write() and cq_thread was blocked in 
>> ibv_get_cq_event->read
>> Taking the return value of the DISCONNECT event into consideration forcefully 
>> killed the process.
>> On delving deeper into this problem, I think that there is more to this rping 
>> hang. Let me work on this further.
>>
> 
> I think rping needs some coordination on these threads and when they
> should be killed. 
> 
Right..

Thanks,
Pradipta


>> On a related note - I noticed another rping hang in the following case
>> - Start the rping as a client without first starting an rping server
>> - If you are lucky the first run itself will result in the 'lt-rping' process in 
>> 'D' state. If not repeating the procedure will result in the hang.
>>
>> This is the o/p.
>>
>> cq completion failed status 5
>> wait for CONNECTED state 10
>> connect error -1
>>
>> Thanks,
>> Pradipta.
>>
>>
>>> Thanks,
>>>
>>> Steve.
>>>
>>>
>>>
>>> On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote:




More information about the general mailing list