[Openib-windows] Windows DMA model

Jan Bottorff jbottorff at xsigo.com
Wed Oct 19 18:50:23 PDT 2005


>Note that support for virtualization will require a whole lot of work -

> to support kernel bypass in a virtual machine, where the 
> application in user-mode in the virtual machine has to bypass 
> both the virtual machine kernel as well as the host's kernel. 
>  It would be great to figure out how to do this in Windows.  
> I currently don't really have a clue, though.

I've looked at the Intel specs on their hardware virtualization and the
host OS hypervisor traps setting register CR3, which contains the
physical address of the root page directory entries. The hardware
virtulization can then set the actual page directories to point to any
actual physical address it wants (essentially the higher physical
address bits used in translation), and fool the guest OS into believing
it owns 0 to whatever physical memory. With a different guest OS HAL (or
perhaps just a PCI bus filter driver), the hypervisor can intercept
physical address translations done through an adapter object. It offhand
seems like it should be possible for the hypervisor to allow one guest
OS to own a device, without the device drivers changing at all. The
driver will need to use an adapter object, to get the correct actual bus
address. MmGetPhysicalAddress will return what the OS thinks is the
address, which because of the virtualized CR3 value, will not actually
be the processor physical address. So bus address == processor physical
address != guest OS physical address.

I think a carefully designed group of drivers based on a virtual bus and
virtual function drivers should be able to share a device between
multiple virtual machines. This would require the virtual bus to run on
the host OS and the virtual function drivers run on the guest OS's. The
host and guest OS would need to handle the virtulization boundry between
the bus and function drivers. If Microsoft is currently writing this, it
might be pretty easy (a relative term) for Windows guest OS's to support
this.

I see Microsoft has job openings for multiple virtulization positions,
and from reading the descriptions sound like they're working pretty hard
on serious virtualization support.

> That's correct.  Kernel clients should definitely do all DMA 
> operations by the book.  The question is whether 
> registrations (both user and kernel) should use the DMA 
> mapping functionality, or just use the physical addresses 
> from the MDL after it has been locked down.  The former will 
> result in verifier breaking anything that uses registered 
> memory, and the latter will result in broken DMA due to the 
> assumption that CPU and bus addresses are consistent and 
> cache coherent.  I have doubts that kernel bypass could even 
> work without cache coherency, though.

I think the problem is assuming you can just register normal cached
memeory, for either kernel or user mode. AllocateCommonBuffer should "do
the right thing" and know if things are cache coherent or not. If it's
memory on a card and is mapped uncached, there are no cache coherency
issues (it's not coherent). Of course processor read/write performance
from uncached memory may not be as fast as from cached memory, although
streaming copies might be pretty fast. Kernel bypass seems ok provided
you use memory from AllocateCommonBuffer and don't try to change it's
cache attributes in a mapping.

> For example, for internal ring buffers like those used for 
> CQE and WQE rings, performing proper DMA mappings will break 
> the hardware if verifier remaps these.
> I suppose a way around that is to allocate those buffers one 
> page at a time with AllocateCommonBuffer, build up an MDL 
> with the underlying CPU physical pages using 
> MmGetPhysicalAddress on the returned virtual address, remap 
> it to a contiguous virtual memory region using 
> MmMapLockedPagesSpecifyCache, and then use the bus physical 
> addresses originally returned by AllocateCommonBuffer to 
> program the HCA.  I don't know if this sequence would work 
> properly, and it still doesn't solve the issue of an 
> application registering its buffers.

The ring buffers should be in common memory allocated with
AllocateCommonBuffer. You can't have the same physical memory mapped as
both cached and uncached, this is why MmMapLockedPagesSpecifyCache
exists. So why do you want to build a virtual address other than what
AllocateCommonBuffer returns? I have to admit the CacheEnabled parameter
to AllocateCommonBuffer is a little unclear but believe it's just a
"hint", and if the system is cache coherent for all I/O, you get cached
memory. If the system is not cache coherent, you get uncached memory.
I'll ask MSFT what the real story is.

> Agreed, but any WHQL certifications require Microsoft to 
> define a WHQL certification process for InfiniBand devices.  

It seems unknown if there will be IB specific WHQL tests. There ARE
storage and network tests, which IB virtual drivers will need to pass
(which may be really hard). The actual IB fabric driver may have to get
certification as an "other" device. Getting data center level WHQL
certification for everying may be extreemly hard. On the other hand,
iSCSI and TOE ethernet I do believe will have very official WHQL
certification. My experience is 10 GbE TOE ethernet goes pretty fast and
current chips do RDMA and direct iSCSI and TCP DMA transfers into
appropriate buffers.

> Aren't the AMD processors cache coherent, even in their NUMA 
> architecture?

I think on smaller systems things are always coherent, on bigger systems
maybe not. I also know some (many?) AMD systems have a thing called the
IOMMU, which are bascially REAL map registers. I think 4 cores will be
the low end (up to a few K$) server configuration in a year (or less).
Up to I think 8 cores (??) uses the built in SMP architecture. I believe
above that you need HyperTransport crossbar switch things. At some
point, the overhead of making ALL memory coherent becomes a serious
bottleneck and you have coherency domains, which may take software help
to manage. Since IB tends to be targeted to higher end systems, being
incompatable with larger systems seems like a problem.

Another area where processor address != bus address is assuring security
of kernel memory. There are laws (the DCMA) in the US that say computer
makers will need to assure copyrighted material (i.e. music/movies...)
does not have it's copy protection bypassed. This is a sticky problem
for general purpose computers that allow OS access to all of memory. The
movie companies basically don't want HDTV copies of their intellectual
property showing up on the Internet. One solution is the OS controls
access to "protected" areas of memory, and simply will not let you DMA
data from those areas. This would require some hardware support (IOMMU?)
to enforce. By default, the IOMMU could be set to not pass ANY addresses
from the I/O bus, and ONLY pass addresses sanctioned by the OS. In this
case bus address == processor address with bounds. This may also be a
reason for virtualization, even if you only have a single guest OS. If
you do this, "protected" memory is simply inaccessable in kernel code.
Only "trusted" organizations may get to write hypervisors, and the cpu
or other hardware can require the hypervisor to be digitally signed. As
a driver developer, it's a pretty ugly law. Microsoft probably likes it
because hardware may evolve such than Linux will simply not run on
common systems, as it's not signed by the people who verify compliance
with the DCMA. Or a less radical view, protected content will only be
accessable on systems that support DCMA protection.

Business may also want this virtulization for protection of data from
viruses and such too.

> How do you solve cache coherency issues without getting rid 
> of kernel bypass?
> Making calls to the kernel to flush the CPU or DMA controller 
> buffers for every user-mode I/O is going to take away the 
> benefits of doing kernel bypass in the first place.  That's 
> not to say we won't come to this conclusion, I'm just 
> throwing the questions out there.  I'm not expecting you to 
> have the answers - they're just questions that I don't know 
> how to answer, and I appreciate the discussion.

It's only a problem if you allow arbitrary buffers, if buffers are
allocated in the "proper" way, it's not an issue. Your memory
performance may be less on some systems, although those systems will
tend to be higher powered systems to start with (like 16/32 core SMP).

This message got rather longer than expected, sorry.

- Jan



More information about the ofw mailing list