[openib-general] [PATCH] rdma_cm oops: rdma_destroy_id

Michael S. Tsirkin mst at mellanox.co.il
Wed Mar 15 04:00:58 PST 2006


Sean, I am seeing the following Oops with CMA:

Unable to handle kernel NULL pointer dereference at 00000000000003f8 RIP:
<ffffffff8805f381>{:rdma_cm:cma_cancel_operation+89}
PGD 17e7b6067 PUD 17c427067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: ib_sdp rdma_cm ib_cm ib_local_sa findex ib_addr ib_ipoib
ib_sa ib_umad ib_mthca ib_mad ib_core
Pid: 8113, comm: a.out Not tainted 2.6.15 #4
RIP: 0010:[<ffffffff8805f381>]
<ffffffff8805f381>{:rdma_cm:cma_cancel_operation+89}
RSP: 0018:ffff81017c3bbe08  EFLAGS: 00010296
RAX: 0000000000000000 RBX: ffff81017e35b400 RCX: 0000000000000000
RDX: 0000000000000246 RSI: 0000000000000246 RDI: ffffffff80480240
RBP: 0000000000000000 R08: 00000000fffffffe R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff81017b1150d0
R13: ffff8101796e9438 R14: ffff81017fc39a80 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff805e3880(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000003f8 CR3: 000000017e3ec000 CR4: 00000000000006e0
Process a.out (pid: 8113, threadinfo ffff81017c3ba000, task ffff81017da16200)
Stack: 0000000000000001 ffff81017a262c00 ffff81017e35b400 ffffffff8805f512
       0000003000000018 0000000000000246 0000000000000246 ffffffff803695f0
       ffffffff8049e680 ffffffff80275014
Call Trace:<ffffffff8805f512>{:rdma_cm:rdma_destroy_id+31}
<ffffffff803695f0>{lock_sock+181}
       <ffffffff80275014>{extract_entropy+75}
<ffffffff880664d1>{:ib_sdp:sdp_close+116}
       <ffffffff803a7ca0>{inet_release+75} <ffffffff8036744b>{sock_release+23}
       <ffffffff80367e64>{sock_close+44} <ffffffff8017b2a4>{__fput+155}
       <ffffffff80178a10>{filp_close+91} <ffffffff80178aa6>{sys_close+142}
       <ffffffff8010f92a>{system_call+126}

Code: 0f b6 80 f8 03 00 00 83 f8 01 72 0a 83 f8 03 76 0f 83 f8 04
RIP <ffffffff8805f381>{:rdma_cm:cma_cancel_operation+89} RSP <ffff81017c3bbe08>
CR2: 00000000000003f8

Apparently, if I call rdma_destroy_id after I called rdma_resolve_addr
but before the address was resolved, the device pointer is NULL.
As a result, cma_cancel_addr then crashes when trying to get the device type.

There is something else I'd like a clarification on: rdma_destroy_id
only starts the destroy process, asynchronously. What is the best way
to know the process has finished? Also, when is it safe to call rdma_destroy_qp?

The following works for me:

---

Since we always start address resolution with rdma_resolve_ip, we should
always cancel it with rdma_addr_cancel - we can't switch on device type
since we don't yet know what device we will use.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

Index: linux-2.6.15/drivers/infiniband/core/cma.c
===================================================================
--- linux-2.6.15.orig/drivers/infiniband/core/cma.c	2006-03-15 16:39:43.000000000 +0200
+++ linux-2.6.15/drivers/infiniband/core/cma.c	2006-03-15 16:40:05.000000000 +0200
@@ -553,13 +553,7 @@ static int cma_notify_user(struct rdma_i
 
 static void cma_cancel_addr(struct rdma_id_private *id_priv)
 {
-	switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
-	case RDMA_TRANSPORT_IB:
-		rdma_addr_cancel(&id_priv->id.route.addr.dev_addr);
-		break;
-	default:
-		break;
-	}
+	rdma_addr_cancel(&id_priv->id.route.addr.dev_addr);
 }
 
 static void cma_cancel_route(struct rdma_id_private *id_priv)

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies



More information about the general mailing list