[ofa-general] IPoIB kernel Oops -- possible race condition identified.
Jack Morgenstein
jackm at dev.mellanox.co.il
Mon Jan 26 07:41:08 PST 2009
The following Oops occurred several times on an X86 host when unloading the driver:
(console command sequence:
/etc/init.d/openibd start
opensm &
pkill -2 opensm
/etc/init.d/openibd stop
)
********************************************************************
IP: [<f8e67a49>] :ib_ipoib:ipoib_mcast_join_task+0x193/0x217
*pde = 00000000
Oops: 0000 [#1] SMP
...
Pid: 22483, comm: ipoib Not tainted (2.6.27.5 #1)
EIP: 0060:[<f8e67a49>] EFLAGS: 00010286 CPU: 1
EIP is at ipoib_mcast_join_task+0x193/0x217 [ib_ipoib]
EAX: 00000000 EBX: c2060480 ECX: 0005c700 EDX: ffffffff
ESI: c20605dc EDI: c2060154 EBP: c2060480 ESP: f72aff64
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process ipoib (pid: 22483, ti=f72af000 task=f59fcdc0 task.ti=f72af000)
Stack: c2060000 00000004 00000005 00000005 00000001 02500848 00001000 00000000
00000000 00010008 03000001 02001200 00000504 f509bbc0 c2060508 f8e678b6
00000000 c04307a8 f509bbc0 c0430e7c f509bbcc c0430f2f 00000000 f59fcdc0
Call Trace:
[<f8e678b6>] ipoib_mcast_join_task+0x0/0x217 [ib_ipoib]
[<c04307a8>] run_workqueue+0x6a/0xdf
[<c0430e7c>] worker_thread+0x0/0xbd
[<c0430f2f>] worker_thread+0xb3/0xbd
[<c04330a0>] autoremove_wake_function+0x0/0x2d
[<c0432fdf>] kthread+0x38/0x5d
[<c0432fa7>] kthread+0x0/0x5d
[<c0404473>] kernel_thread_helper+0x7/0x10
=======================
EIP: [<f8e67a49>] ipoib_mcast_join_task+0x193/0x217 [ib_ipoib] SS:ESP 0068:f72aff64
**********************************************************************
ipoib_mcast_join_task +0x193 is at (in file ipoib_multicast.c):
priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
I think the problem is the following:
priv->broadcast is NULLed out in procedure ipoib_mcast_dev_flush(), under the protection
of a spinlock.
However, in ipoib_mcast_join_task(), there is no spinlock protection in the access to
priv->broadcast in the crash line given above.
Note that there seems to be a race condition here.
If the flush occurs after the following test at the start ipoib_mcast_join_task():
if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
return;
then there is no protection at all later for priv->broadcast being NULLed elsewhere.
- Jack
More information about the general
mailing list