[ofa-general] Re: RDMA/iwarp CM question
Kanoj Sarcar
kanojsarcar at yahoo.com
Wed Sep 12 12:49:42 PDT 2007
--- Steve Wise <swise at opengridcomputing.com> wrote:
>
>
> Kanoj Sarcar wrote:
> > Response to original mail did not come to me, but
> I
> > see it in the archives, responding back to the
> > archived response. Please reply all on your
> responses.
> >
>
> I did reply to all. My outgoing folder shows that
> it went to both of
> your addresses...
>
Hmmm, this mail arrived in my yahoo bulk folder, might
have happened with thel last one too, I probably
overlooked, sorry.
>
> > If the driver detaches the incoming (child)
> connection
> > request from the listener at the point of sending
> the
> > IW_CM_EVENT_CONNECT_REQUEST upcall, then for
> on-card
> > connection clean up and child state cleanup in
> driver,
> > OFA must guarantee that a accept/reject downcall
> will
> > be made in the future.
>
> Or you can time it out in your driver.
>
See below.
> >
>
> >
> > I don't believe that gurantee currently exists.
> There
> > is exactly one failure point in the call chain
> >
>
cm_work_handler():process_event():cm_conn_req_handler()
> > that driver reject interface is invoked, but at
> > multiple other failure points, this is not done.
> >
>
> >
> > Also, looking at ucma.c, on destruction of a
> listener,
> > I believe ucma_cleanup_events() will go around
> killing
> > all pending IW_CM_EVENT_CONNECT_REQUEST requests,
> so
> > the app will never get a chance to do the
> > accept/reject.
> >
>
> >
>
> It looks to me like ucma_clean_events() calls
> rdma_destroy_id() /
> iw_destroy_cm_id() / destroy_cm_id() which calls the
> provider reject
> function. Or NOT! :) There's a comment in the
> IW_CM_STATE_CONN_RECV
> case inside destroy_cm_id():
>
> > /*
> > * App called destroy
> before/without calling accept after
> > * receiving connection request
> event notification or
> > * returned non zero from the
> event callback function.
> > * In either case, must tell the
> provider to reject.
> > */
>
> But I don't see the call to reject the connection...
>
> Maybe you could add it and see if it clears up your
> issue?
I haven't hit a problem yet, I am looking at what my
driver should/should not do ...
>
>
> > Doesn't this sound like a problem (namely
> > provider/card resource leak due to races with
> listener
> > destruct)?
> >
>
> It does.
>
> But MPA mandates a timeout so the connections will
> get aborted
> eventually by the provider or peer...
>
I believe the timeout you are talking about applies to
limiting how long it takes (on responder side) from an
incoming SYN to receipt of complete MPA request. I
don't believe there is much logic in having a timeout
between the incoming-connect upcall send by the driver
and an eventual accept/reject done by the app, but
thats a seperate discussion.
The core problem is this though. On a listener
destruct, the driver can either do:
a. destroy all children on which an accept/reject has
not yet been invoked, and OFA stack then must stop app
from sending an accept/reject down in such case. There
is currently an attempt to do this at the ucma layer
(eg cleanup unpolled events), but it is not race free.
b. OFA guarantees than an eventual accept/reject
downcall will be made, and driver can rely on that to
prevent resource leakage.
Any other solution will have some problem somewhere.
EG, in your timeout suggestion, if the driver goes
ahead and cleans up the state on on-card resource for
the child, due to the race mentioned in a) above, the
app might succeed in making an eventual accept/reject,
leading to a kernel crash.
> But I think you've found a bug...
>
> Steve.
>
Are folks filing bugs in bugzilla or similar?
Thanks.
Kanoj
____________________________________________________________________________________
Check out the hottest 2008 models today at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html
More information about the general
mailing list