[ewg] rping/cxgb3 regression

Steve Wise swise at opengridcomputing.com
Tue Feb 15 17:56:35 PST 2011


I pulled it down, built/installed it on 2 nodes, then ran a bunch of 
rpings.  No hangs.  Looks good!

Thanks Sean.  Sorry about this.

Steve.

On 2/15/2011 7:46 PM, Hefty, Sean wrote:
> I placed a 1.0.14.1 package on the ofa server in the downloads/rdmacm section.  Can you verify that it works?  If so, I'll ask to pull it into 1.5.3
>
>> -----Original Message-----
>> From: Steve Wise [mailto:swise at opengridcomputing.com]
>> Sent: Tuesday, February 15, 2011 10:37 AM
>> To: Hefty, Sean
>> Cc: OpenFabrics EWG; Tziporet Koren
>> Subject: Re: rping/cxgb3 regression
>>
>>
>> On 02/15/2011 12:18 PM, Hefty, Sean wrote:
>>>> I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok?
>> I
>>>> guess to do this you would have to push a
>>>> 1-off librdmacm without those changes?  Or maybe back up what is in
>> OFED-
>>>> 1.5.3 to the previous release without this
>>>> rping change?
>>>>
>>>> Thoughts?
>>> Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the issue?
>> (Btw, the author listed in my git tree is wrong.)
>> Yes.
>>
>>> I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe reverting
>> this change and pushing out 1.0.14.1 would work.  There's just one other
>> change after 1.0.14 at the moment, and it's to the build, so I'd skip a
>> full release for now.
>>> Let me know if you think this would work.
>>>
>> I just tested that removing this from 1.0.14 will resolve the issue for
>> 1.5.3.
>>
>>
>>> - Sean
>>>
>>> ---
>>>
>>>       librdmacm/rping: Make sure CQ event thread exits before destroying
>> the CQ
>>>       It is possible for the CQ event thread to poll the CQ after it has
>> been
>>>       destroyed which can result in a seg fault on T3 interfaces.  This
>> patch
>>>       waits for the thread to exit before destroying the CQ.
>>>
>>>       Signed-off-by: Steve Wise<swise at opengridcomputing.com>
>>>       Signed-off-by: Sean Hefty<sean.hefty at intel.com>
>>>
>>> diff --git a/examples/rping.c b/examples/rping.c
>>> index 2d4c2de..ee292ec 100644
>>> --- a/examples/rping.c
>>> +++ b/examples/rping.c
>>> @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb
>> *cb)
>>>                   ret = 0;
>>>
>>>                   if (wc.status) {
>>> -                       if (wc.status != IBV_WC_WR_FLUSH_ERR) {
>>> +                       if (wc.status != IBV_WC_WR_FLUSH_ERR)
>>>                                   fprintf(stderr,
>>>                                           "cq completion failed status
>> %d\n",
>>>                                           wc.status);
>>> -                               ret = -1;
>>> -                       }
>>> +                       ret = -1;
>>>                           goto error;
>>>                   }
>>>
>>> @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void
>> *arg)
>>>           rping_test_server(cb);
>>>           rdma_disconnect(cb->child_cm_id);
>>> +       pthread_join(cb->cqthread, NULL);
>>>           rping_free_buffers(cb);
>>>           rping_free_qp(cb);
>>> -       pthread_cancel(cb->cqthread);
>>> -       pthread_join(cb->cqthread, NULL);
>>>           rdma_destroy_id(cb->child_cm_id);
>>>           free_cb(cb);
>>>           return NULL;
>>> @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb)
>>>
>>>           rping_test_server(cb);
>>>           rdma_disconnect(cb->child_cm_id);
>>> +       pthread_join(cb->cqthread, NULL);
>>>           rdma_destroy_id(cb->child_cm_id);
>>>    err2:
>>>           rping_free_buffers(cb);
>>> @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb)
>>>
>>>           rping_test_client(cb);
>>>           rdma_disconnect(cb->cm_id);
>>> +       pthread_join(cb->cqthread, NULL);
>>>    err2:
>>>           rping_free_buffers(cb);
>>>    err1:




More information about the ewg mailing list