[Openib-windows] Windows DMA model

Sun Jan 22 04:29:06 PST 2006

> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at silverstorm.com] 
> Sent: Thursday, January 19, 2006 8:22 PM
> To: Leonid Keller
> Cc: Jan Bottorff; openib-windows at openib.org
> Subject: RE: [Openib-windows] Windows DMA model
> 
> > From: Leonid Keller [mailto:leonid at mellanox.co.il]
> > Sent: Thursday, January 19, 2006 12:57 AM
> > 
> > > -----Original Message-----
> > > From: Fab Tillier [mailto:ftillier at silverstorm.com]
> > > Sent: Wednesday, January 18, 2006 7:57 PM
> > >
> > > DMA support isn't broken at all for clients that perform 
> mappings on 
> > > a per-request basis, which is adequate for most existing 
> hardware.  
> > > The existing DMA APIs support scatter/gather and 64-bit 
> addressing 
> > > just fine.  The problem with IB is that of memory registration, 
> > > which is unique to RDMA hardware.  The DMA APIs where not 
> designed 
> > > to allow arbitrary buffers to be converted to common 
> buffers (which 
> > > is really what we're looking for here).
> > 
> > I didn't say, it's broken, it's just uncomfortable.
> > They make one to build their sg lists, which one doesn't need, they 
> > make one to release mapping registers after every transfer, 
> which is 
> > not effective, they do not allow one to start his transfer 
> > immediately, but start it themselves at their disposion ...
> > Yes, I understand, that they wanted to fairly dispense the map 
> > registers between the consumers, but they didn't do that with any 
> > other resource - memory, handles, sync primitives et al. 
> Why the heck do that here ?
> 
> They do this to provide a common HAL interface so that 
> drivers can be written generically. The result is that 
> properly written drivers need only build for the CPU 
> architecture, and can let the HAL take care of hardware 
> difference beyond that.  The API definition provides the 
> maximum amount of flexibility.  It doesn't have to be 
> inefficient either, as the callback function could be invoked 
> from the context of the caller.

It's not that important, because have no practical sense, but just for
the sake of discussion ...

I don't understand why a map register as a resource need to be
differrent from other resources.
OS doesn't limit a driver in using spinlocks, handles, memory, threads
et al, it doesn't allocate for a  driver those resources, doesn't make
it use them only in transfer operations etc ...
By inefficiency I meant the overhead of allocating/releasing the mapping
registers on any transfer operation instead of doing it only once - when
it is appropriate, of course, as in our registration model.
I don't see any connection with general use: all other APIs (memory,
handles, sync primitives ...) are also generic, but do not impose any
restrictions.

>  
> > > > I believe it would enough only 5 functions (let's call them
> > > > NewDmaApi):
> > > >
> > > > With such API and cachable mapping registers we could 
> work both in 
> > > > kernel and user without changing our model. It will 
> also work in a 
> > > > guest OS.
> > >
> > > I don't think we need that many APIs - two should suffice:
> > 
> > Really, I've also suggested only two new APIs, the same as you.
> > 2 flushing functions exist today: KeFlushIoBuffers and 
> > FlushAdapterBuffers, but the latter one is called implicitly upon 
> > PutScatterGatherList.
> 
> You've also introduced the IoGetDmaAdapterEx, which isn't 
> necessary if you're dealing with common buffers (which is 
> really what you're trying to do - make a buffer accessible by 
> both SW and HW).

It's not a new function, just an improved version of the existing one,
which returns some capability info to prevent all the drivers from
making guesses and experiments, which can possibly lead a driver to a
wrong conclusion and a BSOD as a consequence.
It doesn't relate to our problems: it is just a defficience of the
existing interface.

> 
> > I suggested some little thing, facilitating our work and 
> not affecting
> > others:
> > 	- OUT Flags parameter to IoGetDmaAdapter to prevent guessing 
> > experiments on the capabilities of DMA support;
> > 	- Optional PcuNonCachable flag to make user buffer 
> uncachable and 
> > avoid calling KeFlushIoBuffers before every transfer;
> > 	- Optional DmaNonCachable flag to turn if possible off 
> DMA caching 
> > and avoid calling FlushAdapterBuffers after every transfer 
> operation.
> > The latter 2 options save us 2 kernel calls on every transfer 
> > operation of user application and therefore enabling a real 
> kernel bypass.
> 
> You don't need to call any flushing functions when dealing 
> with common buffers, which is why I suggested an API that 
> comes close to common buffer usage.  It eliminates the need 
> for IoGetDmaAdapterEx.  The idea behind my suggestion was to 
> introduce a way to make any buffer a common buffer.  The 
> existing API is restrictive because it allocates the memory 
> as well as maps it.  I would venture that internally, 
> AllocateCommonBuffer is split into a memory allocation call 
> and a call to make that memory accessible for DMA.
> 
> Common buffers also don't require experiments on double 
> buffering.  The AllocateCommonBuffer API takes a flag to 
> control whether the buffer is cacheable, which should be 
> sufficient for our purposes.
> 

My feeling is that we fully agree both in access and goals and differ
only in understanding of technical details.
In my understanding, a buffer, allocated by malloc, is cachable, and we
want to MAKE it unchacable both for CPU and DMA. I guess, that in your
understanding a Common Buffer is already a SUCH one (i.e. - twice
unchachable). So, talking with your language, I wanted user buffer to
become a Common Buffer, but I just thought, that a generic API has to
express that desire explicitly, for first, and ,for second, I was (and
am) not sure, that it is always possible. I mean the chache of DMA in
case of HW mapping registers or PCI bridge caches. If it already is -
the more better.

> > > SCATTER_GATHER_LIST IoMapCommonBuffer( DMA_ADAPTER 
> *pAdapter, VOID 
> > > *pBuf, ULONG Length, BOOLEAN CacheEnabled, HANDLE 
> *phCommonBuffer ); 
> > > VOID IoUnmapCommonBuffer( HANDLE hCommonBuffer );
> > >
> > > The pBuf parameter to IoMakeCommonBuffer could be changed 
> to be an 
> > > MDL list, which might make more sense.  The MDL would be for a 
> > > probed and locked buffer, or for non-paged pool.
> > 
> > It's bad for user buffers, because it makes you first to map the 
> > buffer into kernel space which is not scalable.
> 
> No, you misunderstood.  The buffer used in IoMapCommonBuffer 
> could be allocated through any memory allocation call.  The 
> buffers must be probed and locked regardless, whether you 
> hide that internally to the function or not.  I'm just 
> leaving it explicit.  The IB memory registration call would 
> internally, for user-mode buffers, lock the pages down and 
> then map them as a common buffer.

Ye, I missed it about MmProbeAndLockPages functions. If it doesn't do
mappings, no need to include it implicitly in IoMapCommonBuffer.

> That's still a single kernel transition from the client's perspective.
> MmProbeAndLockPages doesn't map the pages into the kernel's 
> address space - it just makes them resident and locks them 
> down.  If we just pass a virtual address, we need to add to 
> IoMapCommonBuffer all the parameters that MmProbeAndLockPages 
> takes (AccessMode and Operation).  There's no reason to make 
> the function that complicated.
> 
> The sequence of events would be something like this:
> 
> - User calls RegMem with a malloc'd buffer
> - IBAL makes kernel transition
> - kernel proxy builds MDL list for user's buffer
> - kernel proxy calls MmProbeAndLockPages on each MDL in the list
> - kernel proxy calls HCA driver's virtual registration call 
> with MDL list
> - HCA driver calls IoMapCommonBuffer to get SGL for the 
> registered buffer
> - HCA driver uses SGL to program the HCA's translation table
> 

Agree. I was moving to the same sequence.
Wheter to get as a return an SGL or an array of logical addresses in not
too important.

> The idea with the above is that by the time the call gets to 
> the HCA driver, the HCA driver doesn't have to care where the 
> buffer came from or how it was allocated.
> 
> > I suggest to use usual
> > (pagable) buffers both in kernel and user and let 
> IoMapCommonBuffer to 
> > lock it and map to DMA address space. Maybe IoMapCommonBuffer will 
> > manage to do it without building MDL or mapping to kernel.
> 
> An MDL list must be created to map buffers.  Why hide that 
> internally to the IoMapCommonBuffer API?  Look at 
> GetScatterGatherList - it takes an MDL list as input.  
> IoMapCommonBuffers keeps a similar look-and-feel to the existing APIs.
> 
> Remember, there's no restriction on where or how the buffers 
> are allocated.
> 
> Adding the IoMapCommonBuffer (and IoUnmapCommonBuffer too) 
> doesn't require any changes to the existing DMA model, so 
> existing drivers are unchanged.
> 
> Does that make more sense now?  Do you still see problems 
> with my proposal?

My conclusions are as follows:
1) I agree with your API;
2) I think, that all drivers would benefit from OUT Flags parameter of
IoGetDmaAdapter function, but we can live without it.
3) I see 2 open questions for now:
	- whether *any* buffer can be converted in CommonBuffer in your
sense, i.e. to be twice non-cachable;
	- whether MS will agree to do that (because it means, that they
will need to allow a static (pre-)allocation of map registers in favour
of dynamic, which gives possibility to one *application* to take all
mapping registers for itself therefore causing a total system
starvation! )

I can see possible answers to the second question and it is not
connected to our API.
So we need to finalize our APIs according to our understanding of the
first question and send it to Microsoft to see their reaction. 

> 
> Thanks,
> 
> - Fab
> 
>