[openib-general] RE: core and ipoib questions and oops

Mon Sep 26 08:06:22 PDT 2005

Problem is at ipoib_multicast.c:223  :
	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,

a. r14 contains mcast->dev:
drivers/infiniband/ulp/ipoib/ipoib_multicast.c:216
    41a8:       4c 8b b7 f0 00 00 00    mov    0xf0(%rdi),%r14

NOTE THAT r14 is ZERO.  This implies that we still have a pointer to the
mcast structure, but
the entire structure has been zeroed out (the code only sets mcast->dev at
mcast struct allocation time -- it never
zeroes out mcast->dev).  This could happen, for example, if mcast was freed,
then re-allocated and zeroed.

b. r13 contains priv (which is obtained via the netdev_priv macro) -- this
explains the 0x380 offset from NULL in r13:
include/linux/netdevice.h:488
    41b2:       4d 8d ae 80 03 00 00    lea    0x380(%r14),%r13

We can conclude that the mcast group was deleted, but an mcast completion
still got delivered from below by ib_mad
(the thread which failed was ib_mad1).

BTW -- this is the same sort of kernel oops that I sent to Roland on 20.9.05
(we also saw 0x0000000000380 in r13 there).

This might happen, for example, if when invoking the restart task,
wait_for_completion() incorrectly terminated in 
ipoib_mcast_stop_thread() (ipoib_multicast.c:836), then the multicast group
was freed ( ipoib_mcast_free() at 
ipoib_multicast.c:915), and finally a callback was invoked after the free.

Jack

> 
> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il]
> Sent: Monday, September 26, 2005 4:08 PM
> To: Tziporet Koren; Jack Morgenstein
> Subject: Fwd: core and ipoib questions and oops
> 
> 
> Here's an oops I got recently.
> You might want to look into it.
> 
> ----- Forwarded message from "Michael S. Tsirkin" <mst at mellanox.co.il>
> -----
> 
> Subject: core and ipoib questions and oops
> Date: Mon, 26 Sep 2005 15:49:43 +0300
> From: "Michael S. Tsirkin" <mst at mellanox.co.il>
> 
> Two questions:
> 
> 1. Roland, looking at ipoib_multicast, I see
>                if (mcast->query) {
>                         ib_sa_cancel_query(mcast->query_id,
> mcast->query);
>                         mcast->query = NULL;
>                         ipoib_dbg_mcast(priv, "waiting for MGID "
> IPOIB_GID_FMT "\n",
>  
> IPOIB_GID_ARG(mcast->mcmember.mgid));
>                         wait_for_completion(&mcast->done);
>                 }
> 
> what prevents ipoib_mcast_join_complete from running
> at the same time and changing mcast->query after we've tested it?
> 
> 2. All, what happends in the core if I call ib_sa_cancel_query
> while the completion is running, or has already run?
> Is it possible that there's a bug that makes it possible for
> a completion callback to run twice in this case?
> 
> Thanks,
> MST
> 
> ---
> 
> The following oops happends on svn rev 3535.
> 
> #ifconfig ib0 down
> 
> Unable to handle kernel NULL pointer dereference at 0000000000000388
> RIP:
> <ffffffff88045204>{:ib_ipoib:ipoib_mcast_join_finish+100}
> PGD 172cd4067 PUD 172d16067 PMD 0
> Oops: 0000 [1] SMP
> CPU 0
> Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad
> ib_core
> Pid: 2399, comm: ib_mad1 Not tainted 2.6.13
> RIP: 0010:[<ffffffff88045204>]
> <ffffffff88045204>{:ib_ipoib:ipoib_mcast_join_finish+100}
> RSP: 0018:ffff81017348dc58  EFLAGS: 00010282
> RAX: 0000000074010000 RBX: 0000000000000000 RCX: 0000000000000010
> RDX: ffff810177d93380 RSI: ffff810177d93380 RDI: ffff810177d93380
> RBP: ffff810177d93380 R08: 0000000000000000 R09: ffff81017348dd38
> R10: ffff81017348ddf8 R11: 0000000000000001 R12: 0000000000000000
> R13: 0000000000000380 R14: 0000000000000000 R15: ffff810173484898
> FS:  0000000000000000(0000) GS:ffffffff8064b800(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000388 CR3: 0000000174725000 CR4: 00000000000006e0
> Process ib_mad1 (pid: 2399, threadinfo ffff81017348c000, task
> ffff8101773e07f0)
> Stack: 0000000100000092 0000000000000000 0000000000000096
> 0000000000000296
>        0000000000000296 ffffffff8028b8b0 0000000000000096
> ffff81017d1343c0
>        ffff810172fd50c0 ffff810172daba10
> Call Trace:<ffffffff8028b8b0>{dma_pool_free+272}
> <ffffffff88045cbb>{:ib_ipoib:ipoib_mcast_join_complete+43}
>        <ffffffff880021b5>{:ib_core:ib_unpack+198}
> <ffffffff8803bcc6>{:ib_sa:ib_sa_mcmember_rec_callback+64}
>        <ffffffff8803b49e>{:ib_sa:recv_handler+117}
> <ffffffff88010e83>{:ib_mad:ib_mad_completion_handler+949}
>        <ffffffff88010ace>{:ib_mad:ib_mad_completion_handler+0}
>        <ffffffff80143489>{worker_thread+478}
> <ffffffff8012f7da>{default_wake_function+0}
>        <ffffffff8012cff1>{__wake_up_common+64}
> <ffffffff8012f7da>{default_wake_function+0}
>        <ffffffff801473a8>{keventd_create_kthread+0}
> <ffffffff801432ab>{worker_thread+0}
>        <ffffffff801473a8>{keventd_create_kthread+0}
> <ffffffff801474d9>{kthread+204}
>        <ffffffff8010e352>{child_rip+8}
> <ffffffff801473a8>{keventd_create_kthread+0}
>        <ffffffff8014740d>{kthread+0} <ffffffff8010e34a>{child_rip+0}
> 
> 
> Code: 49 8b 7d 08 48 81 c7 b4 00 00 00 f3 a6 75 17 49 8b 45 70 8b
> RIP <ffffffff88045204>{:ib_ipoib:ipoib_mcast_join_finish+100} RSP
> <ffff81017348dc58>
> CR2: 0000000000000388
> 
> 
> -- 
> MST
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> ----- End forwarded message -----
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050926/35c6a317/attachment.html>