[openib-general] Re: uCM kernel oops

Libor Michalek limichal at cisco.com
Thu Jul 21 12:05:50 PDT 2005


On Tue, Jul 19, 2005 at 11:10:56AM -0700, Arlin Davis wrote:
> Hi Libor,
> 
> I am running uCM and uAT with uDAPL and occasionally hit the following. 
> Can you take a look?
>
> Jul 19 11:10:18 iclust-19 kernel: UCM: Write. cmd <1> in <4> out <0> len 
> <12>
> Jul 19 11:10:18 iclust-19 kernel: UCM: Event. CM ID <2> event <7>
> Jul 19 11:10:18 iclust-19 kernel: UCM: Destroyed CM ID <2>
> Jul 19 11:10:18 iclust-19 kernel: Unable to handle kernel paging request 
> <ffffffff880a10c8>{:ib_ucm:ib_ucm_ctx_put+120}
> Trace:<ffffffff880a183f>{:ib_ucm:ib_ucm_event_handler+1199}

  Looks like a race between the destroy command and a DREQ received
event. From looking at the code, it looks like it's possible for
two threads (cm event and userspace) to call ctx_put at the same
time and both try to perform the final zero referece object destroy.
Is this easily reproducible? Can you try the following patch?

-Libor

Index: infiniband/core/ucm.c
===================================================================
--- infiniband/core/ucm.c	(revision 2886)
+++ infiniband/core/ucm.c	(working copy)
@@ -93,14 +93,15 @@
 	down(&ctx_id_mutex);
 
 	ctx->ref--;
-	if (!ctx->ref)
+	if (ctx->ref) {
+		up(&ctx_id_mutex);
+		return;
+	}
+	else
 		idr_remove(&ctx_id_table, ctx->id);
 
 	up(&ctx_id_mutex);
 
-	if (ctx->ref)
-		return;
-
 	down(&ctx->file->mutex);
 
 	list_del(&ctx->file_list);



More information about the general mailing list