[Openib-windows] Using of fast mutexes in WinIb

Mon Nov 28 11:49:37 PST 2005

> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at silverstorm.com]
> Sent: Monday, November 28, 2005 7:47 PM
> To: Leonid Keller
> Cc: openib-windows at openib.org; Erez Cohen
> Subject: RE: [Openib-windows] Using of fast mutexes in WinIb
> 
> 
> Hi Leo,
> 
> > From: Leonid Keller [mailto:leonid at mellanox.co.il]
> > Sent: Sunday, November 27, 2005 3:36 AM
> > 
> > I've studied DMA support in Windows and come to the following
> > conclusions:
> > 
> > 1) Control path: There is no simple way to translate virtual address
> > into logical.
> > 	The only function, that performs this job is
> > AllocateCommonBuffer. It works on PASSIVE_LEVEL.
> > 	For user space it means, that we can't allow to allocate
> > persistent buffers (like CQ ring buffer) in user space, but 
> will need to
> > call into kernel, allocate the buffer there and map it back 
> to the user
> > space, which can be tricky, because requires to reside in the
> > process-originator's address space.
> > [this is ugly, but one can live with that. see now below ...]
> 
> AllocateCommonBuffer doesn't work for applications 
> registering memory, so I
> don't think working around it for the CQE and WQE rings is 
> worth it.  We need to
> find a way to perform long-term DMA mappings, but I don't see 
> that happening
> without some help from Microsoft (which I'm working on).
> 
> > 2) Data path: getting logical addresses imposes using of Windows DMA
> > work model.
> > 	It complicates data path algorithms in kernel and causes serious
> > performance penalties, coming, first, from the need in 
> translating their
> > structures into ours and back, and, which is much more 
> important, from
> > severe limitations on number of map registers for one transfer
> > operation. In a test i wrote, i didn't manage to get more 
> than one map
> > register from IoGetDmaAdaper, which means, that in order to send 1MB
> > buffer i'll need to perform 256 data transfers with an 
> interrupt and DPC
> > processing in each one !!!
> 
> Normally, if you're doing 64-bit DMA and the PCI adapter 
> advertises 64-bit
> support, you should not need any map registers at all (which 
> might be why it
> returns a value of 1).
>  
i'm not sure, i follow you here.
If we want our code to work on all platforms, we can't presume, that we
do not need mapping registers because of this or that. We *must* use
their model, while maybe hoping, that on xYY platforms DMA functions
will have a little overhead. But this hope fails, because they limit the
transfer (in my case) by 1 map register, i.e. by 4KB buffer size, which
is totally unacceptable.

> Mapping registers are only needed if the hardware can't 
> access addresses above
> the 4GB mark, in which case the OS maps the buffers to the 
> lower address space
> to make them available.  

No, it is also to necessary on some architectures, where devices have
really another address space, than CPU.
Examples - some flavours of Itanium. That's why Linux code use DMA
functions for getting logical (they call it DMA-) addresses.

> Driver Verifier complicates things 
> in that it allocates
> an intermediate buffer that it copies to/from before/after 
> the DMA operation, so
> that the hardware actually performs it's DMA to Verifier's 
> buffer, not the
> user's.  This allows Verifier to check that DMA operations 
> are done correctly,
> but totally breaks the semantics of memory registration that 
> we need for IB.

Driver Verifier is a good tool, but there were times when we live
without it.
So, it's uncomfortable, but not critical, in my opinion.

> 
> Did you try to call GetScatterGatherList to see how much of 
> the buffer it mapped
> when it gives you the SGL?

Ye, i gave it my buffer of 0x40000 size and got error: "Insufficient
resources", meaning, as i get, that it has only one mapping register and
can map only 4K!

> 
> > 	For user space we will need to call into kernel every time in
> > order to use these functions, which screws the performance 
> more badly.
> 
> Not to mention that it's not the natural programming model 
> for RDMA, where
> ideally the application registers its buffers up front 
> independently of I/O
> operations.  There may be no local I/O operation if the 
> registered memory is the
> target of an RDMA read or write, and thus no kernel 
> transition to perform
> mappings.
> 

Agree, a very good point.

> > Unless i'm missing here something and taking into account, 
> that physical
> > addresses are equal to logical ones (today) on all x32 and 
> x64 machines
> > and on most Itaniums, i believe, we have no other choice 
> than to give up
> > using DMA model in this (gen1) release.
> 
> I agree that we can't use the DMA model for user-mode I/O.  
> For kernel mode, we
> should be able to have all buffers properly mapped since 
> there shouldn't be
> mapping registers involved.  

There is no mapping operation for already allocated buffers. The mapping
get performed only during transfer.
So we can't use DMA model also in kernel mode.

>We can also use AllocateCommonBuffer in the kernel
> for things like the CQE and WQE rings, 

It is already in code, and i think, we can leave for now, because it
seems not to add overhead.

> but I think we can delay that until we
> solve the user-mode issue.
> 

> I believe that mapping registers are a separate issue from 
> CPU/Bus address
> inconsistency

Why ? I got another impression from the description of the DMA model:
http://download.microsoft.com/download/e/b/a/eba1050f-a31d-436b-9281-92c
dfeae4b45/dma.doc

, and kernel ULPs should do per-I/O DMA 
> operations ideally using
> physical addresses for registrations.  IPoIB and SRP do this 
> currently thanks to
> their respective port drivers - I don't know if SDP does.  

I'm not sure about SRP, but IpoIb doesn't send long messages, so it can
live with such limitations.

> As 
> long as we don't
> enable DMA checking when using driver verifier, we should be OK.
> 
> > I suggest to put this issue on hold till gen2 while trying to prod
> > Microsoft to suggest something more appropriate for OS-bypassing
> > software and hardware, requiring high performance.
> 
> I agree - if nothing else, I'd like to see us come out with a 
> release no later
> than 1Q06, even if it has minimal content (more on that 
> later).  Without some
> input from Microsoft we won't get very far solving these issues.
> 
> - Fab
>