[ofw] RE: [WSD] Duplicate send completion bug

Tzachi Dar tzachid at mellanox.co.il
Wed Dec 19 04:07:14 PST 2007


Thanks for the info Fab,
 
There seems indeed to be a bug as you describe it.
 
There are a few ways of how it can be solved and I would like to know
your opinion before I start.
So, first thing is this, in your description you talk about a problem in
the send code. As far as I can tell, the same problem exactly also
happens in the receive code. So I guess that a solution will have to
solve both problems.
 
I'm looking for a solution that will not introduce new locks if
possibale.
 
So, assuming that the problem is in the send only, I guess that a simple
solution would simply be to abandon the socket_info->send_wr at all.
Following this approach, we use the send_wr.wr_id to hold the overlapped
structure itself and we use the offset and offsethigh in order to store
the socket_info. This seems straight forward, very simple, no locks.
Still this doesn't solve the receive problem.
 
So, assuming the same problem is also in the receiver, I want to
understand which locks I should use and where.
As far as I can see, there is no single lock that I can take to solve
the problem. First, I'll have to take a lock for the sender and another
lock for the receiver. Second and probably worse, locking the
complete_wq function itself probably won't work, as the same problem can
happen the minute I live this function. As so, one will probably have to
take the lock before the call to complete_wq and release it only after
the call to WPUCompleteOverlappedRequest which is a very wide lock (or
actually locks).
 
So, it seems that if I understand correctly, the right solution is to
make another mechanism for allocating the wr and freeing them, which
probably means one more lock/unlock in order to do the allocation.
 
Any feedback is welcomed.
 
Thanks
Tzachi
 


________________________________

	From: Fab Tillier [mailto:ftillier at windows.microsoft.com] 
	Sent: Wednesday, December 19, 2007 3:06 AM
	To: Tzachi Dar
	Cc: ofw at lists.openfabrics.org
	Subject: [WSD] Duplicate send completion bug
	
	

	Hi Tzachi,

	 

	If you are no longer the WSD maintainer, please forward to the
appropriate person.

	 

	There is a race condition in the WSD provider that results in
memory corruption due to a send OVERLAPPED being reported twice, and one
being dropped.

	 

	Take two threads, one (Thread A) the application thread, moving
data, the other (Thread B) is the CQ completion thread.  There are 3
sends posted, so send_cnt == 3, send_idx = 3.

	
	Thread B is in complete_wq, having polled 1 send completions and
processing the it when it gets pre-empted by Thread A.  Thread A call
GetOverlappedResult, polls the CQ and picks up the 2 other send
completions, processes them, and returns, and more send requests are
issued to the provider.  It is possible for Thread A to remain busy
enough processing send and receive completions from the provider that
Thread B doesn't get to complete running.  The send completion that
Thread A is going to process references the send WR (struct _wr) at
index 0.  Thread B completes WR 1 and 2, issues 12 more requests
uneventfully (up to WR 15), all the mean time processing completions so
that send_cnt is less than the limit.  The next send is the eventful
one, because it uses WR at index 0 again - the WR that Thread B is
currently processing.  It overwrites the OVERLAPPED pointer in the WR
structure at index 0.  When this send completes, it will report the new
OVERLAPPED value.  If Thread B gets to run before Thread A completes
this send, it will be marked completed though it is still being
transferred by the HW.  When the send completes, the overlapped will be
marked complete again, potentially completing another send prematurely.
The original OVERLAPPED is lost in this case, and never marked as
complete.

	 

	There are several ways of fixing this, ranging from locking
around complete_wq to redesigning the WR usage to not use a circular
array so that entries aren't reused until they are complete.

	 

	-Fab

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20071219/82ac0722/attachment.html>


More information about the ofw mailing list