[Openib-windows] Using of fast mutexes in WinIb
Leonid Keller
leonid at mellanox.co.il
Mon Nov 28 11:49:37 PST 2005
> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at silverstorm.com]
> Sent: Monday, November 28, 2005 7:47 PM
> To: Leonid Keller
> Cc: openib-windows at openib.org; Erez Cohen
> Subject: RE: [Openib-windows] Using of fast mutexes in WinIb
>
>
> Hi Leo,
>
> > From: Leonid Keller [mailto:leonid at mellanox.co.il]
> > Sent: Sunday, November 27, 2005 3:36 AM
> >
> > I've studied DMA support in Windows and come to the following
> > conclusions:
> >
> > 1) Control path: There is no simple way to translate virtual address
> > into logical.
> > The only function, that performs this job is
> > AllocateCommonBuffer. It works on PASSIVE_LEVEL.
> > For user space it means, that we can't allow to allocate
> > persistent buffers (like CQ ring buffer) in user space, but
> will need to
> > call into kernel, allocate the buffer there and map it back
> to the user
> > space, which can be tricky, because requires to reside in the
> > process-originator's address space.
> > [this is ugly, but one can live with that. see now below ...]
>
> AllocateCommonBuffer doesn't work for applications
> registering memory, so I
> don't think working around it for the CQE and WQE rings is
> worth it. We need to
> find a way to perform long-term DMA mappings, but I don't see
> that happening
> without some help from Microsoft (which I'm working on).
>
> > 2) Data path: getting logical addresses imposes using of Windows DMA
> > work model.
> > It complicates data path algorithms in kernel and causes serious
> > performance penalties, coming, first, from the need in
> translating their
> > structures into ours and back, and, which is much more
> important, from
> > severe limitations on number of map registers for one transfer
> > operation. In a test i wrote, i didn't manage to get more
> than one map
> > register from IoGetDmaAdaper, which means, that in order to send 1MB
> > buffer i'll need to perform 256 data transfers with an
> interrupt and DPC
> > processing in each one !!!
>
> Normally, if you're doing 64-bit DMA and the PCI adapter
> advertises 64-bit
> support, you should not need any map registers at all (which
> might be why it
> returns a value of 1).
>
i'm not sure, i follow you here.
If we want our code to work on all platforms, we can't presume, that we
do not need mapping registers because of this or that. We *must* use
their model, while maybe hoping, that on xYY platforms DMA functions
will have a little overhead. But this hope fails, because they limit the
transfer (in my case) by 1 map register, i.e. by 4KB buffer size, which
is totally unacceptable.
> Mapping registers are only needed if the hardware can't
> access addresses above
> the 4GB mark, in which case the OS maps the buffers to the
> lower address space
> to make them available.
No, it is also to necessary on some architectures, where devices have
really another address space, than CPU.
Examples - some flavours of Itanium. That's why Linux code use DMA
functions for getting logical (they call it DMA-) addresses.
> Driver Verifier complicates things
> in that it allocates
> an intermediate buffer that it copies to/from before/after
> the DMA operation, so
> that the hardware actually performs it's DMA to Verifier's
> buffer, not the
> user's. This allows Verifier to check that DMA operations
> are done correctly,
> but totally breaks the semantics of memory registration that
> we need for IB.
Driver Verifier is a good tool, but there were times when we live
without it.
So, it's uncomfortable, but not critical, in my opinion.
>
> Did you try to call GetScatterGatherList to see how much of
> the buffer it mapped
> when it gives you the SGL?
Ye, i gave it my buffer of 0x40000 size and got error: "Insufficient
resources", meaning, as i get, that it has only one mapping register and
can map only 4K!
>
> > For user space we will need to call into kernel every time in
> > order to use these functions, which screws the performance
> more badly.
>
> Not to mention that it's not the natural programming model
> for RDMA, where
> ideally the application registers its buffers up front
> independently of I/O
> operations. There may be no local I/O operation if the
> registered memory is the
> target of an RDMA read or write, and thus no kernel
> transition to perform
> mappings.
>
Agree, a very good point.
> > Unless i'm missing here something and taking into account,
> that physical
> > addresses are equal to logical ones (today) on all x32 and
> x64 machines
> > and on most Itaniums, i believe, we have no other choice
> than to give up
> > using DMA model in this (gen1) release.
>
> I agree that we can't use the DMA model for user-mode I/O.
> For kernel mode, we
> should be able to have all buffers properly mapped since
> there shouldn't be
> mapping registers involved.
There is no mapping operation for already allocated buffers. The mapping
get performed only during transfer.
So we can't use DMA model also in kernel mode.
>We can also use AllocateCommonBuffer in the kernel
> for things like the CQE and WQE rings,
It is already in code, and i think, we can leave for now, because it
seems not to add overhead.
> but I think we can delay that until we
> solve the user-mode issue.
>
> I believe that mapping registers are a separate issue from
> CPU/Bus address
> inconsistency
Why ? I got another impression from the description of the DMA model:
http://download.microsoft.com/download/e/b/a/eba1050f-a31d-436b-9281-92c
dfeae4b45/dma.doc
, and kernel ULPs should do per-I/O DMA
> operations ideally using
> physical addresses for registrations. IPoIB and SRP do this
> currently thanks to
> their respective port drivers - I don't know if SDP does.
I'm not sure about SRP, but IpoIb doesn't send long messages, so it can
live with such limitations.
> As
> long as we don't
> enable DMA checking when using driver verifier, we should be OK.
>
> > I suggest to put this issue on hold till gen2 while trying to prod
> > Microsoft to suggest something more appropriate for OS-bypassing
> > software and hardware, requiring high performance.
>
> I agree - if nothing else, I'd like to see us come out with a
> release no later
> than 1Q06, even if it has minimal content (more on that
> later). Without some
> input from Microsoft we won't get very far solving these issues.
>
> - Fab
>
More information about the ofw
mailing list