[ofa-general] IPoIB kernel Oops -- race condition

Sun Jun 28 09:09:57 PDT 2009

Jack Morgenstein wrote:
> We have seen the following kernel Oops on IPoIB:
> ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
> Unable to handle kernel paging request for data at address 0x00000054
> adFaulting instruction address: 0xe60b43c4
> Oops: Kernel access of bad area, sig: 11 [#1]
> ...
> NIP [e60b43c4] ib_sa_free_multicast+0x14/0x148 [ib_sa]
> LR [e601fa24] ipoib_mcast_leave+0x140/0x144 [ib_ipoib]
> Call Trace:
> [de6a1ce0] [1000d3c0] 0x1000d3c0 (unreliable)
> [de6a1d00] [e601fa24] ipoib_mcast_leave+0x140/0x144 [ib_ipoib]
> [de6a1d60] [e6020904] ipoib_mcast_dev_flush+0x164/0x1a4 [ib_ipoib]
> [de6a1db0] [e601e2a4] ipoib_ib_dev_down+0x78/0x130 [ib_ipoib]
> [de6a1dd0] [e601c030] ipoib_stop+0xec/0x19c [ib_ipoib]
> [de6a1df0] [c01b0e14] dev_close+0x88/0xd8
> [de6a1e00] [c01b0c84] dev_change_flags+0x154/0x1a8
> [de6a1e20] [c01f6144] devinet_ioctl+0x62c/0x81c
> [de6a1e90] [c01f69fc] inet_ioctl+0xcc/0xf8
> [de6a1ea0] [c01a12f0] sock_ioctl+0x60/0x2ec
> [de6a1ec0] [c008b1d4] vfs_ioctl+0x40/0xc0
> [de6a1ee0] [c008b580] do_vfs_ioctl+0x32c/0x498
> [de6a1f10] [c008b72c] sys_ioctl+0x40/0x74
> [de6a1f40] [c000e780] ret_from_syscall+0x0/0x3c
> 
> Scenario was that someone performed "ifconfig ib0 down" while IPoIB was coming up.
> Problem: a race-condition hole:
> 
>   - procedure ipoib_mcast_join():
> 	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
> 					 &rec, comp_mask, GFP_KERNEL,
> 					 ipoib_mcast_join_complete, mcast);
> 	if (IS_ERR(mcast->mc)) {
> 		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> 
>   - procedure ipoib_mcast_leave():
>         if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> 		ib_sa_free_multicast(mcast->mc);
> 
> Problem:  IPOIB_MCAST_FLAG_BUSY is set in join() before the mcast->mc pointer is valid.  If leave() is called
>           after the busy-flag is set, but before ib_sa_join_multicast returns, mcast->mc will not be valid
>           (and ib_sa_free_multicast() assumes that mcast->mc is valid).
> 
> In fact, the "hole" is as follows:
> 	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> 	******* hole starts here ***********
> 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
> 					 &rec, comp_mask, GFP_KERNEL,
> 					 ipoib_mcast_join_complete, mcast);
> 	if (IS_ERR(mcast->mc)) {
> 		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> 	******* hole ends here ***********
> 
> I don't yet see a clean fix for this.  We need the busy flag set where it is so that we do not do multiple joins,
> and do not do ipoib_send(). Using another flag, say IPOIB_MCAST_FLAG_VALID, and setting it after mcast->mc
> is successfully assigned (and testing for that in leave() ) is a problem, because it may lead to memory leaks.
> We basically need atomicity here, and spinlocks are not an option.
> 
> Any ideas?
> -Jack
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
maybe synchronizing the race with a completion var (like IPoIB does in struct ipoib_path) will help. I think this will work. I can send a patch if you want unless you see this idea doesn't work for this case.

MoniS