[openib-general] Re: RDMA Generic Connection Management

Wed Aug 31 09:17:17 PDT 2005

    James> The device could still be used after it's gone. For
    James> example:

    James>  - the user is configuring SRP via sysfs. The thread in
    James> srp_create_target() has just called ib_sa_path_rec_get()
    James> [srp.c line 1209] and is waiting for the path record query
    James> to complete in wait_for_completion() - the SA callback,
    James> srp_path_rec_completion(), is called. This callback thread
    James> will make several verb calls (ib_create_cq,
    James> ib_req_notify_cq, ib_create_qp, ...) without any
    James> coordination with the hotplug device removal callback,
    James> srp_remove_one

I don't think this can happen.  How could srp_remove_one get past

		wait_for_completion(&host->released);

if the sysfs file is still in use?

    James> Notice that if the SA client's hotplug removal function,
    James> ib_sa_remove_one(), ensured that all callbacks had
    James> completed before returning the problem would be fixed. This
    James> would protect all ULPs from having to deal with hotplug
    James> races in their SA callback function. The fix belongs in the
    James> SA client (the core stack), not in SRP.

All SA client callbacks are driven by the MAD layer.  And
ib_sa_remove_one() does ib_unregister_mad_agent(), which should wait
for all callbacks to finish.  So I think we already do the best we can
here.  Unfortunately the SA client code must clean up after all the
ULPs that depend on it, because ULPs can use the SA up until they know
the device is gone.  But I don't see a way around that.

    James> All the ULPs are deficient with respect to their hotplug
    James> synchronization. Given that there is a common problem,
    James> doesn't it make sense to try and solve it in a generic way
    James> instead of in each ULP?

Yes, but what is the generic way?

 - R.