[openib-general] oops on module teardown (was Re: recursion depth exceeded in ipoi b_workqueue )

Jack Morgenstein jackm at mellanox.co.il
Tue Sep 20 08:25:17 PDT 2005


I tested out your recursion patch on SVN 3487, and it works.  However, while
testing it out, I got the kernel Oops described below (while unloading the
driver). Looks like a race condition (Note that this is in the send-timeout
flow) .

>From disassembly of ib_ipoib.ko (no line-debug info unfortunately), failure
is at address 5360:
    534c:       48 89 95 b0 00 00 00    mov    %rdx,0xb0(%rbp)
    5353:       f0 ff 0d 00 00 00 00    lock decl 0(%rip)        # 535a
<ipoib_mcast_join_complete+0x1fa>
    535a:       0f 88 d9 03 00 00       js     5739
<.text.lock.ipoib_multicast+0x50>
    5360:       41 8b 45 10             mov    0x10(%r13),%eax
    5364:       a8 20                   test   $0x20,%al

I traced the source code to ipoib_multicast.c:434 ( in
ipoib_mcast_join_complete):
	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) 

The dereference failure is in trying to dereference "priv->flags".
(dereferencing priv->flags is the code at address 5360).
"priv" here is "netdev_priv(dev)", implying that "netdev_priv(dev)" is no
longer valid and returns garbage.  This garbage gets dereferenced.

environment:
Host 1 Port 1 connected back-to-back to Host 2 Port 1.

Host 1: while date; do /etc/init.d/openibd start ; /etc/init.d/openibd stop
; done
Host 2: runs opensm.

Jack
============================================================================
====================================

Sep 20 12:05:30 swlab163 kernel: Unable to handle kernel NULL pointer
dereference at 0000000000000390 RIP:
Sep 20 12:05:30 swlab163 kernel:
<ffffffff8807a360>{:ib_ipoib:ipoib_mcast_join_complete+512}
Sep 20 12:05:30 swlab163 kernel: PGD 777d2067 PUD 773ca067 PMD 0
Sep 20 12:05:30 swlab163 kernel: Oops: 0000 [1] SMP
Sep 20 12:05:30 swlab163 kernel: CPU 0
Sep 20 12:05:30 swlab163 kernel: Modules linked in: ib_ipoib ib_sa ib_uverbs
ib_umad ib_mthca ib_mad ib_core video1394 ohci1394 raw1394 ieee1394
Sep 20 12:05:30 swlab163 kernel: Pid: 11302, comm: ib_mad2 Not tainted
2.6.13
Sep 20 12:05:30 swlab163 kernel: RIP: 0010:[<ffffffff8807a360>]
<ffffffff8807a360>{:ib_ipoib:ipoib_mcast_join_complete+512}
Sep 20 12:05:30 swlab163 kernel: RSP: 0018:ffff810055bc1d38  EFLAGS:
00010247
Sep 20 12:05:30 swlab163 kernel: RAX: 0000000000000000 RBX: ffffffff8807e000
RCX: ffffffff88070e10
Sep 20 12:05:30 swlab163 kernel: RDX: 0000000000000000 RSI: 0000000000000000
RDI: ffffffff8807e000
Sep 20 12:05:30 swlab163 kernel: RBP: ffff810053b10880 R08: ffff810055bc0000
R09: 0000000000000000
Sep 20 12:05:30 swlab163 kernel: R10: 00000000ffffffff R11: ffffffff8055f320
R12: 00000000ffffff92
Sep 20 12:05:30 swlab163 kernel: R13: 0000000000000380 R14: ffff81007e409a78
R15: ffffffff88042bd0
Sep 20 12:05:30 swlab163 kernel: FS:  00002aaaab15db00(0000)
GS:ffffffff805d4800(0000) knlGS:0000000056729bb0
Sep 20 12:05:30 swlab163 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Sep 20 12:05:30 swlab163 kernel: CR2: 0000000000000390 CR3: 00000000777d3000
CR4: 00000000000006e0
Sep 20 12:05:30 swlab163 kernel: Process ib_mad2 (pid: 11302, threadinfo
ffff810055bc0000, task ffff810054734830)
Sep 20 12:05:30 swlab163 kernel: Stack: ffff81007a8324c0 ffff810054734830
ffffffff805dffb0 ffffffff803f8855
Sep 20 12:05:30 swlab163 kernel:        ffff810055bc1e58 0000000000000296
ffff810054982f90 00000000ffffff92
Sep 20 12:05:30 swlab163 kernel:        ffff81007e409a10 ffffffff88070e5c
Sep 20 12:05:30 swlab163 kernel: Call
Trace:<ffffffff803f8855>{thread_return+0}
<ffffffff88070e5c>{:ib_sa:ib_sa_mcmember_rec_callback+76}
Sep 20 12:05:30 swlab163 kernel:
<ffffffff8807060c>{:ib_sa:send_handler+156}
<ffffffff88042d4e>{:ib_mad:timeout_sends+382}
Sep 20 12:05:30 swlab163 kernel:        <ffffffff80132ca3>{__wake_up+67}
<ffffffff80147e7e>{worker_thread+478}
Sep 20 12:05:30 swlab163 kernel:
<ffffffff80132210>{default_wake_function+0}
<ffffffff8012f793>{__wake_up_common+67}
Sep 20 12:05:30 swlab163 kernel:
<ffffffff80132210>{default_wake_function+0}
<ffffffff8014c3d0>{keventd_create_kthread+0}
Sep 20 12:05:30 swlab163 kernel:        <ffffffff80147ca0>{worker_thread+0}
<ffffffff8014c3d0>{keventd_create_kthread+0}
Sep 20 12:05:30 swlab163 kernel:        <ffffffff8014c529>{kthread+217}
<ffffffff8010e50e>{child_rip+8}
Sep 20 12:05:30 swlab163 kernel:
<ffffffff8014c3d0>{keventd_create_kthread+0} <ffffffff8014c450>{kthread+0}
Sep 20 12:05:30 swlab163 kernel:        <ffffffff8010e506>{child_rip+0}
Sep 20 12:05:30 swlab163 kernel:
Sep 20 12:05:30 swlab163 kernel: Code: 41 8b 45 10 a8 20 74 3e 41 83 fc 92
75 15 48 8b 3d cb 46 00
Sep 20 12:05:30 swlab163 kernel: RIP
<ffffffff8807a360>{:ib_ipoib:ipoib_mcast_join_complete+512} RSP
<ffff810055bc1d38>
Sep 20 12:05:30 swlab163 kernel: CR2: 0000000000000390

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050920/8e73e996/attachment.html>


More information about the general mailing list