[Openib-windows] Windows DMA model

Leonid Keller leonid at mellanox.co.il
Thu Jan 19 00:57:28 PST 2006


 

> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at silverstorm.com] 
> Sent: Wednesday, January 18, 2006 7:57 PM
> To: Leonid Keller; 'Jan Bottorff'; openib-windows at openib.org
> Subject: RE: [Openib-windows] Windows DMA model
> 
> > From: Leonid Keller [mailto:leonid at mellanox.co.il]
> > Sent: Wednesday, January 18, 2006 5:08 AM
> > 
> > Some thoughts about DMA API.
> > 
> > I find it very uncomfortable: it imposes some way of work with 
> > unrequested implicit actions instead of providing a minimal 
> number of 
> > tools to do the things as we'd like it.
> > I believe, Infiniband is a *typical* solution in the sense, 
> that any 
> > interconnect, providing large bandwidth and low latency 
> will need to 
> > support scatter/gather and 64-bit addressing.
> 
> DMA support isn't broken at all for clients that perform 
> mappings on a per-request basis, which is adequate for most 
> existing hardware.  The existing DMA APIs support 
> scatter/gather and 64-bit addressing just fine.  The problem 
> with IB is that of memory registration, which is unique to 
> RDMA hardware.  The DMA APIs where not designed to allow 
> arbitrary buffers to be converted to common buffers (which is 
> really what we're looking for here).
> 

I didn't say, it's broken, it's just uncomfortable.
They make one to build their sg lists, which one doesn't need, they make
one to release mapping registers after every transfer, which is not
effective, they do not allow one to start his transfer immediately, but
start it themselves at their disposion ...
Yes, I understand, that they wanted to fairly dispense the map registers
between the consumers, but they didn't do that with any other resource -
memory, handles, sync primitives et al. Why the heck do that here ? 


> > Therefore is worth to require additional/another API, that will 
> > eliminate these deficiences both with enabling (as far as possible) 
> > kernel bypass.
> 
> I don't think we can require anything like this, but we can 
> probably make a request for a future version of the OS.
> 
> > I believe it would enough only 5 functions (let's call them 
> NewDmaApi):
> > 
> >    IoGetDmaAdapterEx(*Adapter, *Flags, *NmapRegisters,...)
> >       Returns Adapter object and capability Flags: 
> IsDoubleBuffering, 
> > IsNewDmaApi, IsDmaCacheble.
> > 
> >    IoMapBufToDmaSpace(Adapter, Mode, PcuNonCachable, 
> DmaNonCachable, 
> > addr, size, ...);
> >       Locks the memory, allocates mapping registers and makes them 
> > uncachable, if requested.
> >       Returns an array of logical addresses and an opaque DmaBuffer 
> > object.
> > 	Pay attention:
> >          No need for memory allocation: there are malloc and 
> > ExAllocatePool(Paged).
> >          No need for mapping to kernel space: there is 
> > MmProbeAndLockPages.
> >          An error NOT_ENOUGH_REGISTERS is to be handled as 
> a failure 
> > on memory allocation.
> > 
> >    IoUnmapBufFromDmaSpace(DmaBuffer)
> > 	Undoes the previous.
> > 
> >    KeFlushPcuBuffers(DmaBuffer)
> >       Flushes CPU cache to the memory buffer.
> >       Used only once, if the buffer was mapped as PcuNonCachable.
> > 
> >    KeFlushDmaBuffers(DmaBuffer)
> >       Flushes DMA cache to the memory buffer.
> >       Not used at all, if the buffer was successfully mapped as 
> > DmaNonCachable.
> > 
> > With such API and cachable mapping registers we could work both in 
> > kernel and user without changing our model. It will also work in a 
> > guest OS.
> 
> I don't think we need that many APIs - two should suffice:
> 

Really, I've also suggested only two new APIs, the same as you.
2 flushing functions exist today: KeFlushIoBuffers and
FlushAdapterBuffers, but the latter one is called implicitly upon
PutScatterGatherList.
I suggested some little thing, facilitating our work and not affecting
others:
	- OUT Flags parameter to IoGetDmaAdapter to prevent guessing
experiments on the capabilities of DMA support;
	- Optional PcuNonCachable flag to make user buffer uncachable
and avoid calling KeFlushIoBuffers before every transfer;
	- Optional DmaNonCachable flag to turn if possible off DMA
caching and avoid calling FlushAdapterBuffers after every transfer
operation.
The latter 2 options save us 2 kernel calls on every transfer operation
of user application and therefore enabling a real kernel bypass.

> SCATTER_GATHER_LIST IoMapCommonBuffer( DMA_ADAPTER *pAdapter, 
> VOID *pBuf, ULONG Length, BOOLEAN CacheEnabled, HANDLE 
> *phCommonBuffer ); VOID IoUnmapCommonBuffer( HANDLE hCommonBuffer );
> 
> The pBuf parameter to IoMakeCommonBuffer could be changed to 
> be an MDL list, which might make more sense.  The MDL would 
> be for a probed and locked buffer, or for non-paged pool.
> 

It's bad for user buffers, because it makes you first to map the buffer
into kernel space which is not scalable. I suggest to use usual
(pagable) buffers both in kernel and user and let IoMapCommonBuffer to
lock it and map to DMA address space. Maybe IoMapCommonBuffer will
manage to do it without building MDL or mapping to kernel.

> > Am I missing something ?
> 
> The only issue is that we don't have access to the OS source, 
> and can't just modify it.  If we come up with a good API, 
> then we might have some success in having it added in a 
> future version of the OS.  It's worth a shot, but we'll 
> probably need to implement with the current DMA APIs until a 
> better API is available.

Recall, that current API doesn't support kernel bypass.
It means that we will need perform all application data path in kernel
which will significantly decrease the performance.
Microsoft is going to compete with Linux in servers with IB support, so
they are to be interested to improve their API.

> 
> It's worth discussing what we would want this API to look 
> like, and we can bring it up once we have a good idea of how 
> we want it to work.

Agree.
Come out with your suggestions or comment one the above and I'll suggest
something in more elaborate way.

> 
> - Fab
> 
> 



More information about the ofw mailing list