[ofa-general] IPoIB kernel Oops -- race condition

Jack Morgenstein jackm at dev.mellanox.co.il
Sun Jun 28 01:17:34 PDT 2009


We have seen the following kernel Oops on IPoIB:
ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
Unable to handle kernel paging request for data at address 0x00000054
adFaulting instruction address: 0xe60b43c4
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [e60b43c4] ib_sa_free_multicast+0x14/0x148 [ib_sa]
LR [e601fa24] ipoib_mcast_leave+0x140/0x144 [ib_ipoib]
Call Trace:
[de6a1ce0] [1000d3c0] 0x1000d3c0 (unreliable)
[de6a1d00] [e601fa24] ipoib_mcast_leave+0x140/0x144 [ib_ipoib]
[de6a1d60] [e6020904] ipoib_mcast_dev_flush+0x164/0x1a4 [ib_ipoib]
[de6a1db0] [e601e2a4] ipoib_ib_dev_down+0x78/0x130 [ib_ipoib]
[de6a1dd0] [e601c030] ipoib_stop+0xec/0x19c [ib_ipoib]
[de6a1df0] [c01b0e14] dev_close+0x88/0xd8
[de6a1e00] [c01b0c84] dev_change_flags+0x154/0x1a8
[de6a1e20] [c01f6144] devinet_ioctl+0x62c/0x81c
[de6a1e90] [c01f69fc] inet_ioctl+0xcc/0xf8
[de6a1ea0] [c01a12f0] sock_ioctl+0x60/0x2ec
[de6a1ec0] [c008b1d4] vfs_ioctl+0x40/0xc0
[de6a1ee0] [c008b580] do_vfs_ioctl+0x32c/0x498
[de6a1f10] [c008b72c] sys_ioctl+0x40/0x74
[de6a1f40] [c000e780] ret_from_syscall+0x0/0x3c

Scenario was that someone performed "ifconfig ib0 down" while IPoIB was coming up.
Problem: a race-condition hole:

  - procedure ipoib_mcast_join():
	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
					 &rec, comp_mask, GFP_KERNEL,
					 ipoib_mcast_join_complete, mcast);
	if (IS_ERR(mcast->mc)) {
		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);

  - procedure ipoib_mcast_leave():
        if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
		ib_sa_free_multicast(mcast->mc);

Problem:  IPOIB_MCAST_FLAG_BUSY is set in join() before the mcast->mc pointer is valid.  If leave() is called
          after the busy-flag is set, but before ib_sa_join_multicast returns, mcast->mc will not be valid
          (and ib_sa_free_multicast() assumes that mcast->mc is valid).

In fact, the "hole" is as follows:
	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
	******* hole starts here ***********
	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
					 &rec, comp_mask, GFP_KERNEL,
					 ipoib_mcast_join_complete, mcast);
	if (IS_ERR(mcast->mc)) {
		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
	******* hole ends here ***********

I don't yet see a clean fix for this.  We need the busy flag set where it is so that we do not do multiple joins,
and do not do ipoib_send(). Using another flag, say IPOIB_MCAST_FLAG_VALID, and setting it after mcast->mc
is successfully assigned (and testing for that in leave() ) is a problem, because it may lead to memory leaks.
We basically need atomicity here, and spinlocks are not an option.

Any ideas?
-Jack



More information about the general mailing list