[Openib-windows] Windows DMA model

Wed Oct 19 21:17:42 PDT 2005

Hi Jan,

Thanks for the detailed information!

> >Note that support for virtualization will require a whole lot of work -
> 
> > to support kernel bypass in a virtual machine, where the
> > application in user-mode in the virtual machine has to bypass
> > both the virtual machine kernel as well as the host's kernel.
> >  It would be great to figure out how to do this in Windows.
> > I currently don't really have a clue, though.
> 
> I've looked at the Intel specs on their hardware virtualization and the
> host OS hypervisor traps setting register CR3, which contains the
> physical address of the root page directory entries. The hardware
> virtulization can then set the actual page directories to point to any
> actual physical address it wants (essentially the higher physical
> address bits used in translation), and fool the guest OS into believing
> it owns 0 to whatever physical memory. With a different guest OS HAL (or
> perhaps just a PCI bus filter driver), the hypervisor can intercept
> physical address translations done through an adapter object. It offhand
> seems like it should be possible for the hypervisor to allow one guest
> OS to own a device, without the device drivers changing at all.

I think I agree with you here - a non-virtualization aware driver should work
fine as long as a device is only used by a single OS, whether guest or host.

> The
> driver will need to use an adapter object, to get the correct actual bus
> address. MmGetPhysicalAddress will return what the OS thinks is the
> address, which because of the virtualized CR3 value, will not actually
> be the processor physical address. So bus address == processor physical
> address != guest OS physical address.

I agree we definitely want proper address mappings for memory registrations
here.  I don't know if there are better functions than the DMA APIs, but those
don't quite fit the usage model of RDMA where memory is registered for long term
use and multiple I/O operations.

> > That's correct.  Kernel clients should definitely do all DMA
> > operations by the book.  The question is whether
> > registrations (both user and kernel) should use the DMA
> > mapping functionality, or just use the physical addresses
> > from the MDL after it has been locked down.  The former will
> > result in verifier breaking anything that uses registered
> > memory, and the latter will result in broken DMA due to the
> > assumption that CPU and bus addresses are consistent and
> > cache coherent.  I have doubts that kernel bypass could even
> > work without cache coherency, though.
> 
> I think the problem is assuming you can just register normal cached
> memeory, for either kernel or user mode. AllocateCommonBuffer should "do
> the right thing" and know if things are cache coherent or not. If it's
> memory on a card and is mapped uncached, there are no cache coherency
> issues (it's not coherent). Of course processor read/write performance
> from uncached memory may not be as fast as from cached memory, although
> streaming copies might be pretty fast. Kernel bypass seems ok provided
> you use memory from AllocateCommonBuffer and don't try to change it's
> cache attributes in a mapping.

If we can't safely register normal memory then the Winsock Direct infrastructure
is fatally flawed - in the WSD model, the WSD switch (part of mswsock.dll) will
ask the WSD provider (ibwsd.dll in our case) to register an application's
memory.  The application didn't make any special calls to allocate this memory,
just a standard malloc or HeapAlloc call, so that memory is definitely not
allocated via AllocateCommonBuffer.

It would be great to find out where Microsoft stands on how user-mode RDMA, and
memory registration in general, is supposed to interact with DMA APIs for
non-consistent and non-coherent environments.  I don't see how kernel bypass
could ever work properly without running on consistent and coherent systems.

Perhaps having a way of marking registered memory as non-cacheable on non-cache
coherent systems, and then finding a way to get the bus address for the physical
addresses would solve this.  However, it still doesn't help if memory needs to
be flushed out of the DMA controller (or CPU) without the application explicitly
flushing buffers.

If we can find a way to let memory registrations of arbitrary virtual memory
regions work properly with respect to DMA mapping and cache coherency, we'll
have solved all the issues I think.

> > For example, for internal ring buffers like those used for
> > CQE and WQE rings, performing proper DMA mappings will break
> > the hardware if verifier remaps these.
> > I suppose a way around that is to allocate those buffers one
> > page at a time with AllocateCommonBuffer, build up an MDL
> > with the underlying CPU physical pages using
> > MmGetPhysicalAddress on the returned virtual address, remap
> > it to a contiguous virtual memory region using
> > MmMapLockedPagesSpecifyCache, and then use the bus physical
> > addresses originally returned by AllocateCommonBuffer to
> > program the HCA.  I don't know if this sequence would work
> > properly, and it still doesn't solve the issue of an
> > application registering its buffers.
> 
> The ring buffers should be in common memory allocated with
> AllocateCommonBuffer. You can't have the same physical memory mapped as
> both cached and uncached, this is why MmMapLockedPagesSpecifyCache
> exists.

For user-mode QPs and CQs, the ring buffers are allocated in the application
using malloc or HeapAlloc.  There aren't special calls to the kernel to do the
allocation.  Allocating paged memory and pinning it isn't limited by the size of
the non-paged pool, either, so things scale a whole lot further.

> So why do you want to build a virtual address other than what
> AllocateCommonBuffer returns?

So that the application (whether kernel or user-mode) can treat the ring buffer
as a virtually contiguous region even if it built it from multiple calls to
AllocateCommonBuffer for PAGE_SIZE.  Calling AllocateCommonBuffer at runtime for
large areas is likely to fail due to the inability of finding a large contiguous
region.

So for an 8K buffer, I envision two calls to AllocateCommonBuffer for 4K, build
an MDL with those physical addresses, and then map that MDL into the virtual
space of the user to present a single virtual address.

> I have to admit the CacheEnabled parameter
> to AllocateCommonBuffer is a little unclear but believe it's just a
> "hint", and if the system is cache coherent for all I/O, you get cached
> memory. If the system is not cache coherent, you get uncached memory.
> I'll ask MSFT what the real story is.
> 
> > Agreed, but any WHQL certifications require Microsoft to
> > define a WHQL certification process for InfiniBand devices.
> 
> It seems unknown if there will be IB specific WHQL tests. There ARE
> storage and network tests, which IB virtual drivers will need to pass
> (which may be really hard).

Some of the tests just can't pass due to how IB presents these devices.  For
example, IPoIB reports itself as an 802.3 device, but Ethernet headers are never
sent on the wire.  An application that rolls its own Ethernet packets won't work
properly unless the target has been resolved via ARP first.

There are other limitations, and from my talks with Microsoft about WHQL for IB,
they would define what tests IB devices are exempt from due to not being able to
pass.

> The actual IB fabric driver may have to get
> certification as an "other" device.

I was pretty optimistic about the "other" device WHQL program, but I heard that
was being cancelled.

> Getting data center level WHQL
> certification for everying may be extreemly hard. On the other hand,
> iSCSI and TOE ethernet I do believe will have very official WHQL
> certification. My experience is 10 GbE TOE ethernet goes pretty fast and
> current chips do RDMA and direct iSCSI and TCP DMA transfers into
> appropriate buffers.

I'm hoping that IB attached storage (SRP or iSER) will fit nicely into the iSCSI
WHQL program.  I don't know how well the RDMA and TCP chimney stuff will apply.
It makes sense that things work properly for TOE devices - the DMA mappings
shouldn't be any different than non-TOE devices.  Likewise, RDMA from properly
mapped addresses (as done in IPoIB and SRP) will also work fine for kernel
drivers.  However, I would expect iWARP device vendors that supply a WSD
provider to have the same issues with memory registrations that IB has - that of
needing to register arbitrary user-allocated memory for DMA access.

> > How do you solve cache coherency issues without getting rid
> > of kernel bypass?
> > Making calls to the kernel to flush the CPU or DMA controller
> > buffers for every user-mode I/O is going to take away the
> > benefits of doing kernel bypass in the first place.  That's
> > not to say we won't come to this conclusion, I'm just
> > throwing the questions out there.  I'm not expecting you to
> > have the answers - they're just questions that I don't know
> > how to answer, and I appreciate the discussion.
> 
> It's only a problem if you allow arbitrary buffers, if buffers are
> allocated in the "proper" way, it's not an issue. Your memory
> performance may be less on some systems, although those systems will
> tend to be higher powered systems to start with (like 16/32 core SMP).

This gets back the WSD infrastructure issue I raised above.  I'm hoping that
we're just missing something and that Microsoft has already solved things.

> This message got rather longer than expected, sorry.

No worries - lots of good information.

Thanks!

- Fab