[ofa-general] rdma_cm connect / disconnect / reject race....resulting in crash....

rick richard.frank at oracle.com
Mon Sep 24 11:11:49 PDT 2007


Sean, per our discussion here's the problem description from Olaf...

"
We start to shut down the connection, and call rdma_destroy_qp
on our cm_id. We haven't executed rdma_destroy_id yet.

Now apparently a "connect reject" message comes in from the
other host, and cma_ib_handler() is called with an event
of IB_CM_REJ_RECEIVED. It calls cma_modify_qp_err, which
for some odd reason tries to modify the exact same QP we just
destroyed.

The crash looks like this:

RDS/IB: connection request while the connection exist: 11.0.0.18, disconnecting and reconnecting ic f7ccb800 ic->i_cm_id f7cb2a00
rdma_destroy_qp(f7cb2a00)
Unable to handle kernel NULL pointer dereference at virtual address 000000f8
....
EIP is at ib_modify_qp+0x5/0xe [ib_core]
....
Stack: 00000000 f7cb2a00 f8ac36af 00000006 00000000 1a0f4680 f6742e7c c011cc85
       c495ede0 f671ce30 c495ede0 c495ede0 00000086 c495ede0 c011d1a3 f671ce30
       f671ce30 00000002 c4966de0 00000002 00000000 c495ede0 00000001 00000001
Call Trace:
 [<f8ac36af>] cma_modify_qp_err+0x22/0x2d [rdma_cm]
[...]
 [<f8ac3371>] cma_disable_remove+0x35/0x3b [rdma_cm]
 [<f8ac3e31>] cma_ib_handler+0xe6/0x158 [rdma_cm]
 [<f89150f7>] cm_process_work+0x4a/0x80 [ib_cm]
 [<f8916c33>] cm_rej_handler+0xd3/0x114 [ib_cm]

It dies trying to dereference qp->device->modify_qp
because qp->device is NULL. If you check the stack, you'll see
the exact same cm_id that we just called rdma_destroy_qp() on
(note that the printk("rdma_destroy_qp") that appears above comes
*after* the call itself, so by the time this is printed, the QP
is dead already.

That's easy, I thought. Obviously, rdma_destroy_qp just forgets to
clear cm_id->qp after destroying the queue pair:

 void rdma_destroy_qp(struct rdma_cm_id *id)
 {
        ib_destroy_qp(id->qp);
+	id->qp = NULL;
 }

But that didn't really fix it. So either there's something else
going on which I don't grok yet, or this is just another case of
bad locking.
"




More information about the general mailing list