[openib-general] Re: uCM kernel oops
Libor Michalek
limichal at cisco.com
Thu Jul 21 12:05:50 PDT 2005
On Tue, Jul 19, 2005 at 11:10:56AM -0700, Arlin Davis wrote:
> Hi Libor,
>
> I am running uCM and uAT with uDAPL and occasionally hit the following.
> Can you take a look?
>
> Jul 19 11:10:18 iclust-19 kernel: UCM: Write. cmd <1> in <4> out <0> len
> <12>
> Jul 19 11:10:18 iclust-19 kernel: UCM: Event. CM ID <2> event <7>
> Jul 19 11:10:18 iclust-19 kernel: UCM: Destroyed CM ID <2>
> Jul 19 11:10:18 iclust-19 kernel: Unable to handle kernel paging request
> <ffffffff880a10c8>{:ib_ucm:ib_ucm_ctx_put+120}
> Trace:<ffffffff880a183f>{:ib_ucm:ib_ucm_event_handler+1199}
Looks like a race between the destroy command and a DREQ received
event. From looking at the code, it looks like it's possible for
two threads (cm event and userspace) to call ctx_put at the same
time and both try to perform the final zero referece object destroy.
Is this easily reproducible? Can you try the following patch?
-Libor
Index: infiniband/core/ucm.c
===================================================================
--- infiniband/core/ucm.c (revision 2886)
+++ infiniband/core/ucm.c (working copy)
@@ -93,14 +93,15 @@
down(&ctx_id_mutex);
ctx->ref--;
- if (!ctx->ref)
+ if (ctx->ref) {
+ up(&ctx_id_mutex);
+ return;
+ }
+ else
idr_remove(&ctx_id_table, ctx->id);
up(&ctx_id_mutex);
- if (ctx->ref)
- return;
-
down(&ctx->file->mutex);
list_del(&ctx->file_list);
More information about the general
mailing list