[openib-general] core and ipoib questions and oops

Michael S. Tsirkin mst at mellanox.co.il
Mon Sep 26 05:49:43 PDT 2005


Two questions:

1. Roland, looking at ipoib_multicast, I see
               if (mcast->query) {
                        ib_sa_cancel_query(mcast->query_id, mcast->query);
                        mcast->query = NULL;
                        ipoib_dbg_mcast(priv, "waiting for MGID " IPOIB_GID_FMT "\n",
                                        IPOIB_GID_ARG(mcast->mcmember.mgid));
                        wait_for_completion(&mcast->done);
                }

what prevents ipoib_mcast_join_complete from running
at the same time and changing mcast->query after we've tested it?

2. All, what happends in the core if I call ib_sa_cancel_query
while the completion is running, or has already run?
Is it possible that there's a bug that makes it possible for
a completion callback to run twice in this case?

Thanks,
MST

---

The following oops happends on svn rev 3535.

#ifconfig ib0 down

Unable to handle kernel NULL pointer dereference at 0000000000000388 RIP:
<ffffffff88045204>{:ib_ipoib:ipoib_mcast_join_finish+100}
PGD 172cd4067 PUD 172d16067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: ib_sdp ib_cm ib_ipoib ib_sa ib_umad ib_mthca ib_mad ib_core
Pid: 2399, comm: ib_mad1 Not tainted 2.6.13
RIP: 0010:[<ffffffff88045204>] <ffffffff88045204>{:ib_ipoib:ipoib_mcast_join_finish+100}
RSP: 0018:ffff81017348dc58  EFLAGS: 00010282
RAX: 0000000074010000 RBX: 0000000000000000 RCX: 0000000000000010
RDX: ffff810177d93380 RSI: ffff810177d93380 RDI: ffff810177d93380
RBP: ffff810177d93380 R08: 0000000000000000 R09: ffff81017348dd38
R10: ffff81017348ddf8 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000380 R14: 0000000000000000 R15: ffff810173484898
FS:  0000000000000000(0000) GS:ffffffff8064b800(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000388 CR3: 0000000174725000 CR4: 00000000000006e0
Process ib_mad1 (pid: 2399, threadinfo ffff81017348c000, task ffff8101773e07f0)
Stack: 0000000100000092 0000000000000000 0000000000000096 0000000000000296
       0000000000000296 ffffffff8028b8b0 0000000000000096 ffff81017d1343c0
       ffff810172fd50c0 ffff810172daba10
Call Trace:<ffffffff8028b8b0>{dma_pool_free+272} <ffffffff88045cbb>{:ib_ipoib:ipoib_mcast_join_complete+43}
       <ffffffff880021b5>{:ib_core:ib_unpack+198} <ffffffff8803bcc6>{:ib_sa:ib_sa_mcmember_rec_callback+64}
       <ffffffff8803b49e>{:ib_sa:recv_handler+117} <ffffffff88010e83>{:ib_mad:ib_mad_completion_handler+949}
       <ffffffff88010ace>{:ib_mad:ib_mad_completion_handler+0}
       <ffffffff80143489>{worker_thread+478} <ffffffff8012f7da>{default_wake_function+0}
       <ffffffff8012cff1>{__wake_up_common+64} <ffffffff8012f7da>{default_wake_function+0}
       <ffffffff801473a8>{keventd_create_kthread+0} <ffffffff801432ab>{worker_thread+0}
       <ffffffff801473a8>{keventd_create_kthread+0} <ffffffff801474d9>{kthread+204}
       <ffffffff8010e352>{child_rip+8} <ffffffff801473a8>{keventd_create_kthread+0}
       <ffffffff8014740d>{kthread+0} <ffffffff8010e34a>{child_rip+0}


Code: 49 8b 7d 08 48 81 c7 b4 00 00 00 f3 a6 75 17 49 8b 45 70 8b
RIP <ffffffff88045204>{:ib_ipoib:ipoib_mcast_join_finish+100} RSP <ffff81017348dc58>
CR2: 0000000000000388

Seems to oops at 0xda4 here:
0000000000000d40 <ipoib_mcast_join_finish>:
ipoib_mcast_join_finish():
drivers/infiniband/ulp/ipoib/ipoib_multicast.c:215
     d40:       41 56                   push   %r14
drivers/infiniband/ulp/ipoib/ipoib_multicast.c:223
     d42:       b9 10 00 00 00          mov    $0x10,%ecx
     d47:       fc                      cld
drivers/infiniband/ulp/ipoib/ipoib_multicast.c:215
     d48:       41 55                   push   %r13
     d4a:       41 54                   push   %r12
     d4c:       55                      push   %rbp
     d4d:       48 89 fd                mov    %rdi,%rbp
     d50:       53                      push   %rbx
     d51:       48 83 ec 60             sub    $0x60,%rsp
drivers/infiniband/ulp/ipoib/ipoib_multicast.c:220
     d55:       48 8b 06                mov    (%rsi),%rax

Seems most likely that dev is NULL in the following:

static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
                                   struct ib_sa_mcmember_rec *mcmember)
{
        struct net_device *dev = mcast->dev;
        struct ipoib_dev_priv *priv = netdev_priv(dev);



-- 
MST



More information about the general mailing list