[Openib-windows] Using of fast mutexes in WinIb

Sun Nov 27 03:36:23 PST 2005

I've studied DMA support in Windows and come to the following
conclusions:

1) Control path: There is no simple way to translate virtual address
into logical.
	The only function, that performs this job is
AllocateCommonBuffer. It works on PASSIVE_LEVEL.
	For user space it means, that we can't allow to allocate
persistent buffers (like CQ ring buffer) in user space, but will need to
call into kernel, allocate the buffer there and map it back to the user
space, which can be tricky, because requires to reside in the
process-originator's address space.
[this is ugly, but one can live with that. see now below ...]

2) Data path: getting logical addresses imposes using of Windows DMA
work model.
	It complicates data path algorithms in kernel and causes serious
performance penalties, coming, first, from the need in translating their
structures into ours and back, and, which is much more important, from
severe limitations on number of map registers for one transfer
operation. In a test i wrote, i didn't manage to get more than one map
register from IoGetDmaAdaper, which means, that in order to send 1MB
buffer i'll need to perform 256 data transfers with an interrupt and DPC
processing in each one !!! 
	So our performance will be dependent from "generosity" of
IoGetDmaAdaper function, which in turn depends from several unknown
factors.
	For user space we will need to call into kernel every time in
order to use these functions, which screws the performance more badly.

Unless i'm missing here something and taking into account, that physical
addresses are equal to logical ones (today) on all x32 and x64 machines
and on most Itaniums, i believe, we have no other choice than to give up
using DMA model in this (gen1) release.

I suggest to put this issue on hold till gen2 while trying to prod
Microsoft to suggest something more appropriate for OS-bypassing
software and hardware, requiring high performance.

> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at silverstorm.com]
> Sent: Thursday, November 17, 2005 11:42 PM
> To: 'Leonid Keller'
> Cc: openib-windows at openib.org
> Subject: RE: [Openib-windows] Using of fast mutexes in WinIb
> 
> 
> > From: Leonid Keller [mailto:leonid at mellanox.co.il]
> > Sent: Thursday, November 17, 2005 6:32 AM
> > 
> > > From: Fab Tillier [mailto:ftillier at silverstorm.com]
> > > Sent: Wednesday, November 16, 2005 7:24 PM
> > >
> > > Hi Leo,
> > >
> > > > From: Leonid Keller [mailto:leonid at mellanox.co.il]
> > > > Sent: Tuesday, November 15, 2005 7:32 AM
> > > >
> > > > Hi Fab,
> > > > I come across the following problem: implementation of
> > > > cl_mutex_acquire() via Fast Mutexes causes all the code 
> in critical
> > > > section to work at APC_LEVEL.
> > > >
> > > > The first case i saw, was at the start-up of IpoIb driver,
> > > > which takes mutex in __ipoib_pnp_cb and makes all MTHCA driver
> > > > control verbs to work at APC_LEVEL, which is troublesome.
> > > > (e.g., create_cq calls AllocateCommonBuffer, which requires
> > > > PASSIVE_LEVEL).
> > >
> > > Why does create_cq call AllocateCommonBuffer?  Are you 
> making multiple
> > > page-sized calls instead of a single larger call for the
> > > cases where the memory required spans multiple pages?  Physically
> > > contiguous memory is a scarce resource, so any time you can break
> > > up your requests into page sized requests the better.
> > 
> > The algorithm of allocating tries to allocate one contiguous buffer.
> > If it fails, it requests lesser buffers. In the worst case it will
> > allocate N buffers 1 page size each.
> 
> Are there performance advantages to allocating a contiguous 
> buffer, or does it
> not make a difference?  We should try to avoid requiring 
> contiguous memory all
> together if it doesn't have a performance benefit.
> 
> > I used AllocateCommonBuffer in the first time, because it 
> returns bus
> > addresses.
> 
> Ok, I see.  This doesn't help user-mode clients, though - 
> their buffers are
> allocated and must be registered.  I think it would be 
> simpler to combine both
> kernel and user logic so that it is similar.  The issues of 
> proper DMA mappings
> for memory registrations need to be solved no matter how 
> things are allocated in
> the kernel, and once we have a solution for user-mode, it 
> will also apply to
> kernel mode.
> 
> > One can implement that for the work at DISPATCH_LEVEL:
> > 	va = MmAllocateContiguousMemorySpecifyCache(...);
> > 	p_mdl = IoAllocateMdl( va, ...);
> > 	MmBuildMdlForNonPagedPool( p_mdl );
> > 	la = p_adapter->MapTransfer(adapter, p_mdl ...);
> > 
> > It has 2 little drawbacks as far as i see:
> > 	1) MmAllocateContiguousMemorySpecifyCache allocates always an
> > integer number of pages;
> > 	2) MapTransfer fails, when the number of map registers gets
> > exceeded. (But maybe AllocateCommonBuffer will also fail in 
> this case)
> > 
> > What do you think ?
> 
> Again, if contiguous memory doesn't have a performance 
> benefit, we can just use
> ExAllocatePoolWithTag, followed with IoAllocateMdl and 
> MmBuildMdlForNonPagedPool
> like in your example above.  For user-mode, the only 
> additional step would be a
> call to MmProbeAndLockPages since the memory would be pagable.
> 
> Since the HCAs are all 64-bit hardware, I don't think we need to use
> MapTransfer.  I think we should use GetScatterGatherList 
> instead, as it more
> closely provides the functionality we need.
> 
> Note that both MapTransfer and GetScatterGatherList (as well 
> as any DMA mapping
> functions) will not work properly if Driver Verifier is 
> enabled to check DMA
> usage.  Driver verifier breaks any driver that depends on 
> sharing the buffer.
> In this case, AllocateCommonBuffer is the only way that I 
> know of, but leave
> user-mode out of luck.
> 
> So given that we need to support user-mode, we need to find a 
> solution to how to
> properly DMA map long-term memory registrations without 
> Driver Verifier breaking
> things.  I'm looking into this with Microsoft and will report 
> back any findings.
> 
> - Fab
>