[Openib-windows] Receive Queue

Fabian Tillier ftillier at silverstorm.com
Mon Mar 27 11:11:36 PST 2006


Hi Guy,

On 3/26/06, Guy Corem <guyc at voltaire.com> wrote:
>
> Hi Fabian and all,
>
> While testing the Pallas reduce_scatter over IPoIB, with 3 hosts using the
> following command line:
>
> mpiexec.exe -env MPICH_NETMASK 10.0.0.0/255.255.255.0 -hosts 3 10.0.0.1
> 10.0.0.2 10.0.0.3 PMB_MPI1.exe reduce_scatter
>
> I've discovered that the receive queue (default to 128 packets) is being
> exhausted at the second host (10.0.0.2).
>
> The TCP/IP stack is holding all the 128 packets, and the only way to regain
> connectivity with this host (even ping are not working, of course) is to
> kill the smpd application (or wait enough time for TCP timeouts to expired –
> although I didn't actually test it).
>
> When setting the receive queue to 1024 packets, the problem didn't occur.

I think the default queue depths are definitely suboptimal, and we
should probably turn them up.  However, we really need to solve the
problem even at shallow queue depths.

> All my machines are 2-way SMPs. When running with /ONECPU boot.ini
> parameter, the problem occurred, but less frequently.
>
> My questions:
> Have you encounter similar situations?

No, I haven't, at least not knowingly.  The majority of my perf
testing with TCP sockets has been with WSD enabled.  I do know that
TCP performance is pretty low, but don't know if it is related.

> I've noticed the "Receive Pool Growth" parameter – but it doesn't seem to be
>"connected". Why ?

The NDIS buffer pools don't have a dynamic version that I know of. 
The packet pools do, however.  I started putting this in place and
then hit this issue and left it to be revisited.  From what I can
tell, NdisAllocateBufferPool returns NULL on x86 and x64 (I haven't
checked on IA64), and NdisAllocateBuffer maps to IoAllocateMdl.  I
don't know if this is something we can rely on, however, and haven't
had time to pursue it.

> If I would like to "connect" it (i.e. write the
> appropriate code to handle queue growth) what should be done and where?

There's a little work that should be done on the receive side to get
things to be proper with respect to DMA, and these will complicate
dynamic receive buffer pools somewhat.  Currently, the driver doesn't
allocate a common buffer for the receive buffers, but should.  This
means the current lookaside list for receive buffers would get
changed.  Beyond that, we need to use NdisAllocatePacketPoolEx to
indicate a dynamic receive packet pool.

If we run out of buffer space, we should be able to call
NdisAllocateSharedMemoryAsync to adapt to load dynamically.

> And I really don't know if someone can answer this: Why does the Windows
> TCP/IP stack behave in such a way ? Why it doesn't copy the packets in case
> of extreme situations like the above ?

That's because of me...  Miniport drivers have the option of
indicating packets with NDIS_STATUS_RESOURCES, and the driver should
probably be updated to do so.  As it stands, if a miniport indicates
with STATUS_SUCCESS, NDIS gets to keep the buffer until it feels like
returning it.

Adding code to return NDIS_STATUS_RESOURCES shouldn't be too difficult
- you'd need to set a low-watermark below which you'd return this, as
well as keep track of how many packets you return this status for so
they can be reclaimed.

There is the receive pool ratio, which determines how many buffers to
have in the pool as a ratio to the receive queue depth.  It currently
defaults to 1:1, but could be set higher.

- Fab



More information about the ofw mailing list