[openib-general] resend [PATCH] rping.c: Fix hang if either the server or the client exits early
Pradipta Kumar Banerjee
bpradip at in.ibm.com
Fri Jun 23 09:09:50 PDT 2006
Steve Wise wrote:
> On Fri, 2006-06-23 at 18:20 +0530, Pradipta Kumar Banerjee wrote:
>> Steve Wise wrote:
>>> The goal of adding the return codes was so that the rping program could
>>> exit with a status indicating success or failure. Every rping run
>>> results in a DISCONNECT event, so I don't think we want to treat that
>>> case as an error.
>> DISCONNECT event will be generated when the connection is closed or in case of
>> some error (like CCAE_LLP_CONNECTION_LOST, CCAE_BAD_CLOSE in case of Ammasso
>> driver etc).
>
> You'll also get the DISCONNECT event when one side finished the rping
> loops and does rdma_disconnect(). So receiving that event isn't
> necessarily an error...
Yes definitely, but this event can _also_ be received due to errors!!
>
>
>>> Also, can you explain why thi fixes Amith's problem, which sounded like
>>> a process was hanging?
>>>
>> On debugging I found that the main thread was blocked in ibv_destroy_cq(),
>> cm_thread was blocked in rdma_get_cm_event->write() and cq_thread was blocked in
>> ibv_get_cq_event->read
>> Taking the return value of the DISCONNECT event into consideration forcefully
>> killed the process.
>> On delving deeper into this problem, I think that there is more to this rping
>> hang. Let me work on this further.
>>
>
> I think rping needs some coordination on these threads and when they
> should be killed.
>
Right..
Thanks,
Pradipta
>> On a related note - I noticed another rping hang in the following case
>> - Start the rping as a client without first starting an rping server
>> - If you are lucky the first run itself will result in the 'lt-rping' process in
>> 'D' state. If not repeating the procedure will result in the hang.
>>
>> This is the o/p.
>>
>> cq completion failed status 5
>> wait for CONNECTED state 10
>> connect error -1
>>
>> Thanks,
>> Pradipta.
>>
>>
>>> Thanks,
>>>
>>> Steve.
>>>
>>>
>>>
>>> On Fri, 2006-06-23 at 00:53 +0530, Pradipta Kumar Banerjee wrote:
More information about the general
mailing list