[Openib-windows] Windows DMA model

Tue Jan 17 03:58:13 PST 2006

First, to inform: I've found the problem with my kernel DMA tesing.
It seems like a bug in Microsoft: when you ask map registers for a 2GB
transfer - which is the max for our cards - IoGetDmaAdapter returns 1
register. For *any* number X, less than that, they return an appropriate
number.
It seems like gives us opportunity to solve all the problems for kernel
work.
As for the userland, I also don't see a solution, but giving up
os-bypassing and performing memory registration during send/recv.

> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at silverstorm.com] 
> Sent: Thursday, October 20, 2005 6:18 AM
> To: 'Jan Bottorff'; openib-windows at openib.org
> Subject: RE: [Openib-windows] Windows DMA model
> 
> Hi Jan,
> 
> Thanks for the detailed information!
> 
> > >Note that support for virtualization will require a whole 
> lot of work 
> > >-
> > 
> > > to support kernel bypass in a virtual machine, where the 
> application 
> > > in user-mode in the virtual machine has to bypass both 
> the virtual 
> > > machine kernel as well as the host's kernel.
> > >  It would be great to figure out how to do this in Windows.
> > > I currently don't really have a clue, though.
> > 
> > I've looked at the Intel specs on their hardware virtualization and 
> > the host OS hypervisor traps setting register CR3, which 
> contains the 
> > physical address of the root page directory entries. The hardware 
> > virtulization can then set the actual page directories to 
> point to any 
> > actual physical address it wants (essentially the higher physical 
> > address bits used in translation), and fool the guest OS into 
> > believing it owns 0 to whatever physical memory. With a different 
> > guest OS HAL (or perhaps just a PCI bus filter driver), the 
> hypervisor 
> > can intercept physical address translations done through an adapter 
> > object. It offhand seems like it should be possible for the 
> hypervisor 
> > to allow one guest OS to own a device, without the device 
> drivers changing at all.
> 
> I think I agree with you here - a non-virtualization aware 
> driver should work fine as long as a device is only used by a 
> single OS, whether guest or host.
> 
> > The
> > driver will need to use an adapter object, to get the 
> correct actual 
> > bus address. MmGetPhysicalAddress will return what the OS thinks is 
> > the address, which because of the virtualized CR3 value, will not 
> > actually be the processor physical address. So bus address == 
> > processor physical address != guest OS physical address.
> 
> I agree we definitely want proper address mappings for memory 
> registrations here.  I don't know if there are better 
> functions than the DMA APIs, but those don't quite fit the 
> usage model of RDMA where memory is registered for long term 
> use and multiple I/O operations.
> 
> > > That's correct.  Kernel clients should definitely do all DMA 
> > > operations by the book.  The question is whether 
> registrations (both 
> > > user and kernel) should use the DMA mapping 
> functionality, or just 
> > > use the physical addresses from the MDL after it has been locked 
> > > down.  The former will result in verifier breaking anything that 
> > > uses registered memory, and the latter will result in 
> broken DMA due 
> > > to the assumption that CPU and bus addresses are consistent and 
> > > cache coherent.  I have doubts that kernel bypass could even work 
> > > without cache coherency, though.
> > 
> > I think the problem is assuming you can just register normal cached 
> > memeory, for either kernel or user mode. 
> AllocateCommonBuffer should 
> > "do the right thing" and know if things are cache coherent 
> or not. If 
> > it's memory on a card and is mapped uncached, there are no cache 
> > coherency issues (it's not coherent). Of course processor 
> read/write 
> > performance from uncached memory may not be as fast as from cached 
> > memory, although streaming copies might be pretty fast. 
> Kernel bypass 
> > seems ok provided you use memory from AllocateCommonBuffer 
> and don't 
> > try to change it's cache attributes in a mapping.
> 
> If we can't safely register normal memory then the Winsock 
> Direct infrastructure is fatally flawed - in the WSD model, 
> the WSD switch (part of mswsock.dll) will ask the WSD 
> provider (ibwsd.dll in our case) to register an application's 
> memory.  The application didn't make any special calls to 
> allocate this memory, just a standard malloc or HeapAlloc 
> call, so that memory is definitely not allocated via 
> AllocateCommonBuffer.
> 
> It would be great to find out where Microsoft stands on how 
> user-mode RDMA, and memory registration in general, is 
> supposed to interact with DMA APIs for non-consistent and 
> non-coherent environments.  I don't see how kernel bypass 
> could ever work properly without running on consistent and 
> coherent systems.
> 
> Perhaps having a way of marking registered memory as 
> non-cacheable on non-cache coherent systems, and then finding 
> a way to get the bus address for the physical addresses would 
> solve this.  However, it still doesn't help if memory needs 
> to be flushed out of the DMA controller (or CPU) without the 
> application explicitly flushing buffers.
> 
> If we can find a way to let memory registrations of arbitrary 
> virtual memory regions work properly with respect to DMA 
> mapping and cache coherency, we'll have solved all the issues I think.
> 
> > > For example, for internal ring buffers like those used 
> for CQE and 
> > > WQE rings, performing proper DMA mappings will break the 
> hardware if 
> > > verifier remaps these.
> > > I suppose a way around that is to allocate those buffers 
> one page at 
> > > a time with AllocateCommonBuffer, build up an MDL with the 
> > > underlying CPU physical pages using MmGetPhysicalAddress on the 
> > > returned virtual address, remap it to a contiguous virtual memory 
> > > region using MmMapLockedPagesSpecifyCache, and then use the bus 
> > > physical addresses originally returned by AllocateCommonBuffer to 
> > > program the HCA.  I don't know if this sequence would 
> work properly, 
> > > and it still doesn't solve the issue of an application 
> registering 
> > > its buffers.
> > 
> > The ring buffers should be in common memory allocated with 
> > AllocateCommonBuffer. You can't have the same physical 
> memory mapped 
> > as both cached and uncached, this is why 
> MmMapLockedPagesSpecifyCache 
> > exists.
> 
> For user-mode QPs and CQs, the ring buffers are allocated in 
> the application using malloc or HeapAlloc.  There aren't 
> special calls to the kernel to do the allocation.  Allocating 
> paged memory and pinning it isn't limited by the size of the 
> non-paged pool, either, so things scale a whole lot further.
> 
> > So why do you want to build a virtual address other than what 
> > AllocateCommonBuffer returns?
> 
> So that the application (whether kernel or user-mode) can 
> treat the ring buffer as a virtually contiguous region even 
> if it built it from multiple calls to AllocateCommonBuffer 
> for PAGE_SIZE.  Calling AllocateCommonBuffer at runtime for 
> large areas is likely to fail due to the inability of finding 
> a large contiguous region.
> 
> So for an 8K buffer, I envision two calls to 
> AllocateCommonBuffer for 4K, build an MDL with those physical 
> addresses, and then map that MDL into the virtual space of 
> the user to present a single virtual address.
> 
> > I have to admit the CacheEnabled parameter to 
> AllocateCommonBuffer is 
> > a little unclear but believe it's just a "hint", and if the 
> system is 
> > cache coherent for all I/O, you get cached memory. If the system is 
> > not cache coherent, you get uncached memory.
> > I'll ask MSFT what the real story is.
> > 
> > > Agreed, but any WHQL certifications require Microsoft to define a 
> > > WHQL certification process for InfiniBand devices.
> > 
> > It seems unknown if there will be IB specific WHQL tests. There ARE 
> > storage and network tests, which IB virtual drivers will 
> need to pass 
> > (which may be really hard).
> 
> Some of the tests just can't pass due to how IB presents 
> these devices.  For example, IPoIB reports itself as an 802.3 
> device, but Ethernet headers are never sent on the wire.  An 
> application that rolls its own Ethernet packets won't work 
> properly unless the target has been resolved via ARP first.
> 
> There are other limitations, and from my talks with Microsoft 
> about WHQL for IB, they would define what tests IB devices 
> are exempt from due to not being able to pass.
> 
> > The actual IB fabric driver may have to get certification as an 
> > "other" device.
> 
> I was pretty optimistic about the "other" device WHQL 
> program, but I heard that was being cancelled.
> 
> > Getting data center level WHQL
> > certification for everying may be extreemly hard. On the 
> other hand, 
> > iSCSI and TOE ethernet I do believe will have very official WHQL 
> > certification. My experience is 10 GbE TOE ethernet goes 
> pretty fast 
> > and current chips do RDMA and direct iSCSI and TCP DMA 
> transfers into 
> > appropriate buffers.
> 
> I'm hoping that IB attached storage (SRP or iSER) will fit 
> nicely into the iSCSI WHQL program.  I don't know how well 
> the RDMA and TCP chimney stuff will apply.
> It makes sense that things work properly for TOE devices - 
> the DMA mappings shouldn't be any different than non-TOE 
> devices.  Likewise, RDMA from properly mapped addresses (as 
> done in IPoIB and SRP) will also work fine for kernel 
> drivers.  However, I would expect iWARP device vendors that 
> supply a WSD provider to have the same issues with memory 
> registrations that IB has - that of needing to register 
> arbitrary user-allocated memory for DMA access.
> 
> > > How do you solve cache coherency issues without getting rid of 
> > > kernel bypass?
> > > Making calls to the kernel to flush the CPU or DMA controller 
> > > buffers for every user-mode I/O is going to take away the 
> benefits 
> > > of doing kernel bypass in the first place.  That's not to say we 
> > > won't come to this conclusion, I'm just throwing the 
> questions out 
> > > there.  I'm not expecting you to have the answers - they're just 
> > > questions that I don't know how to answer, and I appreciate the 
> > > discussion.
> > 
> > It's only a problem if you allow arbitrary buffers, if buffers are 
> > allocated in the "proper" way, it's not an issue. Your memory 
> > performance may be less on some systems, although those 
> systems will 
> > tend to be higher powered systems to start with (like 16/32 
> core SMP).
> 
> This gets back the WSD infrastructure issue I raised above.  
> I'm hoping that we're just missing something and that 
> Microsoft has already solved things.
> 
> > This message got rather longer than expected, sorry.
> 
> No worries - lots of good information.
> 
> Thanks!
> 
> - Fab
> 
> 
> _______________________________________________
> openib-windows mailing list
> openib-windows at openib.org
> http://openib.org/mailman/listinfo/openib-windows
>