[openib-general] possible cma bug
Steve Wise
swise at opengridcomputing.com
Thu Jan 26 11:04:28 PST 2006
Sean,
While debugging some iwarp connection setup problems, I _might_ have
stumbled onto a cma bug.
I'm running the kernel cmatose. The server side gets a connect request,
but the init_node() returns an error because the qp create fails. The
cmatose module then rejects the connect request on that connect request
upcall. Concurrently (on the main work thread running run_server()),
cmatose calls rdma_destroy_id() on the listening id. The destroy
happens before the connect request upcall thread finishes (SMP :).
Then as the other thread doing the connection request upcall unwinds the
stack and finishes processing in iw_conn_req_handler(), the system
Oopses in cma_release_remove() at line 1048 (with the iwarp cma patch).
I think the oops is because the listen_id was already destroyed, and
iw_conn_req_handler() didn't have a refence to it. So the
cma_release_remove() code is touching freed memory.
I _think_ the solution is to bump the listen_id refcnt at the top of
cma_req_handler() and iw_conn_req_handler(), and do a cma_deref_id() on
the listen_id at the end of the functions.
I added this logic to the iwarp side and it appears to have fixed the
problem.
Steve.
More information about the general
mailing list