[Openib-windows] A few issues running a program that was doing many connects.

Fabian Tillier ftillier at silverstorm.com
Thu Jul 6 13:12:19 PDT 2006


Hi Tzachi,

On 7/6/06, Tzachi Dar <tzachid at mellanox.co.il> wrote:
>
> Hi Fab,
>
> In the previous couple of days I was doing a study on a small program. This
> program was using WSD and was doing many connects to the remote side. (the
> source of it is in the end of this mail).
>
> It seems that this program was running, but looking at the task manager
> showed that the memory, threads and handles were going up.
>
> I have made some investigation of the problem found there and here are my
> results:
>
> 1) Once the program was connecting, the number of WSD sockets that were
> opened and not closed was increasing all the time. This seems like a reason
> for the leak. When I have stopped the main thread from connecting, it seems
> that the sockets have been closed. This is some mitigation, but If there is
> something that we can do about it than we should probably do.

I saw the same issues with the sockdie WHQL tests.  I went through a
lot of pain to accelerate the disconnect and socket destruction paths
in the WSD provider, and I think they are as good as they'll get.  All
I can think of with this is that the WSD switch queues socket cleanup
on some internal thread, and this doesn't happen as fast as
connections.

Note that one of the WinsockFunctional tests was failing because of
this - the server would close a socket and the client would do a
TransferFile call that was succeeding when it should not have.
However, upon investigation it was found that the close socket call
from the server didn't actually reach the WSD provider and close the
socket - it was decoupled inside the WSD switch.

> 2) It seems that the mechanism that allocates a CQ and a thread for it every
> time that we reach the maximum size is broken. That is there might be empty
> CQ's while new ones will still be opened. This problem will probably be
> smaller when resize CQ will be implemented.

Yes, the CQ thread allocation was designed around CQ resize working.
In my previous tests, a single CQ could support 3000 sockets, so the
total number of CQ threads should be quite small.

> 3) The next problem that I saw was that the handles don't come down even
> when I stop from time to time. After some debugging I understood that the
> main reason is that this handles represent allocations. When they are freed
> they return to a pool that can only increase in size. This is probably fine.

Yes, there are some pools internal to IBAL.  Some off these, like the
MAD and CEP pools use NonPagedLookasideLists, which should free the
memory if the system starts to be low on memory.

> 4) The last issue was memory demand that was constantly increasing. using
> gflags and umdh.exe I was not able to find any real leak (the program was
> still leaking about 50 MB a minute). After some investigation, I came to
> conclusion that the memory that was registered in the cache and later we
> have used MmSecureVirtualMemory on it was leaking. (I'll probably have to
> make some more investigations to fully understand this, but this is what I
> currently see). When I have removed this call from the driver it seems that
> the problem was over (As the program got stacked very soon it is hard to be
> sure).

I'm confused by what you are saying here.  Do you mean that the call
to MmSecureVirtualMemory is causing memory to be leaked?

> So, My questions are: 1) have you seen this before. 2) One of the WSD goals
> is to run everything that Ethernet can. It seems that this simple program
> will run forever on Ethernet, but only a few seconds on WSD. Can we do
> anything to fix it?

I would expect some (if not most) of these issues to require
Microsoft's involvement.  I agree that the goal should be to be able
to run everything that runs over TCP sockets on the standard network
stack (within reason of course).  Note that due to buffering in NDIS,
some scenarios that work over NDIS will fail over WSD due to protocol
deadlock once transfers go into zero-copy RDMA mode.

- Fab




More information about the ofw mailing list