[ewg] rping/cxgb3 regression
Hefty, Sean
sean.hefty at intel.com
Tue Feb 15 18:00:05 PST 2011
Not a big deal.
Vlad, can you pull librdmacm 1.0.14.1 into the next OFED 1.5.3 RC? The only change versus 1.0.14 is reverting a patch to the rping sample.
Thanks,
Sean
> -----Original Message-----
> From: Steve Wise [mailto:swise at opengridcomputing.com]
> Sent: Tuesday, February 15, 2011 5:57 PM
> To: Hefty, Sean
> Cc: OpenFabrics EWG; Tziporet Koren
> Subject: Re: rping/cxgb3 regression
>
> I pulled it down, built/installed it on 2 nodes, then ran a bunch of
> rpings. No hangs. Looks good!
>
> Thanks Sean. Sorry about this.
>
> Steve.
>
> On 2/15/2011 7:46 PM, Hefty, Sean wrote:
> > I placed a 1.0.14.1 package on the ofa server in the downloads/rdmacm
> section. Can you verify that it works? If so, I'll ask to pull it into
> 1.5.3
> >
> >> -----Original Message-----
> >> From: Steve Wise [mailto:swise at opengridcomputing.com]
> >> Sent: Tuesday, February 15, 2011 10:37 AM
> >> To: Hefty, Sean
> >> Cc: OpenFabrics EWG; Tziporet Koren
> >> Subject: Re: rping/cxgb3 regression
> >>
> >>
> >> On 02/15/2011 12:18 PM, Hefty, Sean wrote:
> >>>> I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok?
> >> I
> >>>> guess to do this you would have to push a
> >>>> 1-off librdmacm without those changes? Or maybe back up what is in
> >> OFED-
> >>>> 1.5.3 to the previous release without this
> >>>> rping change?
> >>>>
> >>>> Thoughts?
> >>> Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the
> issue?
> >> (Btw, the author listed in my git tree is wrong.)
> >> Yes.
> >>
> >>> I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe
> reverting
> >> this change and pushing out 1.0.14.1 would work. There's just one other
> >> change after 1.0.14 at the moment, and it's to the build, so I'd skip a
> >> full release for now.
> >>> Let me know if you think this would work.
> >>>
> >> I just tested that removing this from 1.0.14 will resolve the issue for
> >> 1.5.3.
> >>
> >>
> >>> - Sean
> >>>
> >>> ---
> >>>
> >>> librdmacm/rping: Make sure CQ event thread exits before
> destroying
> >> the CQ
> >>> It is possible for the CQ event thread to poll the CQ after it
> has
> >> been
> >>> destroyed which can result in a seg fault on T3 interfaces. This
> >> patch
> >>> waits for the thread to exit before destroying the CQ.
> >>>
> >>> Signed-off-by: Steve Wise<swise at opengridcomputing.com>
> >>> Signed-off-by: Sean Hefty<sean.hefty at intel.com>
> >>>
> >>> diff --git a/examples/rping.c b/examples/rping.c
> >>> index 2d4c2de..ee292ec 100644
> >>> --- a/examples/rping.c
> >>> +++ b/examples/rping.c
> >>> @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb
> >> *cb)
> >>> ret = 0;
> >>>
> >>> if (wc.status) {
> >>> - if (wc.status != IBV_WC_WR_FLUSH_ERR) {
> >>> + if (wc.status != IBV_WC_WR_FLUSH_ERR)
> >>> fprintf(stderr,
> >>> "cq completion failed status
> >> %d\n",
> >>> wc.status);
> >>> - ret = -1;
> >>> - }
> >>> + ret = -1;
> >>> goto error;
> >>> }
> >>>
> >>> @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void
> >> *arg)
> >>> rping_test_server(cb);
> >>> rdma_disconnect(cb->child_cm_id);
> >>> + pthread_join(cb->cqthread, NULL);
> >>> rping_free_buffers(cb);
> >>> rping_free_qp(cb);
> >>> - pthread_cancel(cb->cqthread);
> >>> - pthread_join(cb->cqthread, NULL);
> >>> rdma_destroy_id(cb->child_cm_id);
> >>> free_cb(cb);
> >>> return NULL;
> >>> @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb)
> >>>
> >>> rping_test_server(cb);
> >>> rdma_disconnect(cb->child_cm_id);
> >>> + pthread_join(cb->cqthread, NULL);
> >>> rdma_destroy_id(cb->child_cm_id);
> >>> err2:
> >>> rping_free_buffers(cb);
> >>> @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb)
> >>>
> >>> rping_test_client(cb);
> >>> rdma_disconnect(cb->cm_id);
> >>> + pthread_join(cb->cqthread, NULL);
> >>> err2:
> >>> rping_free_buffers(cb);
> >>> err1:
More information about the ewg
mailing list