[Openib-windows] Windows DMA model
Leonid Keller
leonid at mellanox.co.il
Tue Jan 17 03:58:13 PST 2006
First, to inform: I've found the problem with my kernel DMA tesing.
It seems like a bug in Microsoft: when you ask map registers for a 2GB
transfer - which is the max for our cards - IoGetDmaAdapter returns 1
register. For *any* number X, less than that, they return an appropriate
number.
It seems like gives us opportunity to solve all the problems for kernel
work.
As for the userland, I also don't see a solution, but giving up
os-bypassing and performing memory registration during send/recv.
> -----Original Message-----
> From: Fab Tillier [mailto:ftillier at silverstorm.com]
> Sent: Thursday, October 20, 2005 6:18 AM
> To: 'Jan Bottorff'; openib-windows at openib.org
> Subject: RE: [Openib-windows] Windows DMA model
>
> Hi Jan,
>
> Thanks for the detailed information!
>
> > >Note that support for virtualization will require a whole
> lot of work
> > >-
> >
> > > to support kernel bypass in a virtual machine, where the
> application
> > > in user-mode in the virtual machine has to bypass both
> the virtual
> > > machine kernel as well as the host's kernel.
> > > It would be great to figure out how to do this in Windows.
> > > I currently don't really have a clue, though.
> >
> > I've looked at the Intel specs on their hardware virtualization and
> > the host OS hypervisor traps setting register CR3, which
> contains the
> > physical address of the root page directory entries. The hardware
> > virtulization can then set the actual page directories to
> point to any
> > actual physical address it wants (essentially the higher physical
> > address bits used in translation), and fool the guest OS into
> > believing it owns 0 to whatever physical memory. With a different
> > guest OS HAL (or perhaps just a PCI bus filter driver), the
> hypervisor
> > can intercept physical address translations done through an adapter
> > object. It offhand seems like it should be possible for the
> hypervisor
> > to allow one guest OS to own a device, without the device
> drivers changing at all.
>
> I think I agree with you here - a non-virtualization aware
> driver should work fine as long as a device is only used by a
> single OS, whether guest or host.
>
> > The
> > driver will need to use an adapter object, to get the
> correct actual
> > bus address. MmGetPhysicalAddress will return what the OS thinks is
> > the address, which because of the virtualized CR3 value, will not
> > actually be the processor physical address. So bus address ==
> > processor physical address != guest OS physical address.
>
> I agree we definitely want proper address mappings for memory
> registrations here. I don't know if there are better
> functions than the DMA APIs, but those don't quite fit the
> usage model of RDMA where memory is registered for long term
> use and multiple I/O operations.
>
> > > That's correct. Kernel clients should definitely do all DMA
> > > operations by the book. The question is whether
> registrations (both
> > > user and kernel) should use the DMA mapping
> functionality, or just
> > > use the physical addresses from the MDL after it has been locked
> > > down. The former will result in verifier breaking anything that
> > > uses registered memory, and the latter will result in
> broken DMA due
> > > to the assumption that CPU and bus addresses are consistent and
> > > cache coherent. I have doubts that kernel bypass could even work
> > > without cache coherency, though.
> >
> > I think the problem is assuming you can just register normal cached
> > memeory, for either kernel or user mode.
> AllocateCommonBuffer should
> > "do the right thing" and know if things are cache coherent
> or not. If
> > it's memory on a card and is mapped uncached, there are no cache
> > coherency issues (it's not coherent). Of course processor
> read/write
> > performance from uncached memory may not be as fast as from cached
> > memory, although streaming copies might be pretty fast.
> Kernel bypass
> > seems ok provided you use memory from AllocateCommonBuffer
> and don't
> > try to change it's cache attributes in a mapping.
>
> If we can't safely register normal memory then the Winsock
> Direct infrastructure is fatally flawed - in the WSD model,
> the WSD switch (part of mswsock.dll) will ask the WSD
> provider (ibwsd.dll in our case) to register an application's
> memory. The application didn't make any special calls to
> allocate this memory, just a standard malloc or HeapAlloc
> call, so that memory is definitely not allocated via
> AllocateCommonBuffer.
>
> It would be great to find out where Microsoft stands on how
> user-mode RDMA, and memory registration in general, is
> supposed to interact with DMA APIs for non-consistent and
> non-coherent environments. I don't see how kernel bypass
> could ever work properly without running on consistent and
> coherent systems.
>
> Perhaps having a way of marking registered memory as
> non-cacheable on non-cache coherent systems, and then finding
> a way to get the bus address for the physical addresses would
> solve this. However, it still doesn't help if memory needs
> to be flushed out of the DMA controller (or CPU) without the
> application explicitly flushing buffers.
>
> If we can find a way to let memory registrations of arbitrary
> virtual memory regions work properly with respect to DMA
> mapping and cache coherency, we'll have solved all the issues I think.
>
> > > For example, for internal ring buffers like those used
> for CQE and
> > > WQE rings, performing proper DMA mappings will break the
> hardware if
> > > verifier remaps these.
> > > I suppose a way around that is to allocate those buffers
> one page at
> > > a time with AllocateCommonBuffer, build up an MDL with the
> > > underlying CPU physical pages using MmGetPhysicalAddress on the
> > > returned virtual address, remap it to a contiguous virtual memory
> > > region using MmMapLockedPagesSpecifyCache, and then use the bus
> > > physical addresses originally returned by AllocateCommonBuffer to
> > > program the HCA. I don't know if this sequence would
> work properly,
> > > and it still doesn't solve the issue of an application
> registering
> > > its buffers.
> >
> > The ring buffers should be in common memory allocated with
> > AllocateCommonBuffer. You can't have the same physical
> memory mapped
> > as both cached and uncached, this is why
> MmMapLockedPagesSpecifyCache
> > exists.
>
> For user-mode QPs and CQs, the ring buffers are allocated in
> the application using malloc or HeapAlloc. There aren't
> special calls to the kernel to do the allocation. Allocating
> paged memory and pinning it isn't limited by the size of the
> non-paged pool, either, so things scale a whole lot further.
>
> > So why do you want to build a virtual address other than what
> > AllocateCommonBuffer returns?
>
> So that the application (whether kernel or user-mode) can
> treat the ring buffer as a virtually contiguous region even
> if it built it from multiple calls to AllocateCommonBuffer
> for PAGE_SIZE. Calling AllocateCommonBuffer at runtime for
> large areas is likely to fail due to the inability of finding
> a large contiguous region.
>
> So for an 8K buffer, I envision two calls to
> AllocateCommonBuffer for 4K, build an MDL with those physical
> addresses, and then map that MDL into the virtual space of
> the user to present a single virtual address.
>
> > I have to admit the CacheEnabled parameter to
> AllocateCommonBuffer is
> > a little unclear but believe it's just a "hint", and if the
> system is
> > cache coherent for all I/O, you get cached memory. If the system is
> > not cache coherent, you get uncached memory.
> > I'll ask MSFT what the real story is.
> >
> > > Agreed, but any WHQL certifications require Microsoft to define a
> > > WHQL certification process for InfiniBand devices.
> >
> > It seems unknown if there will be IB specific WHQL tests. There ARE
> > storage and network tests, which IB virtual drivers will
> need to pass
> > (which may be really hard).
>
> Some of the tests just can't pass due to how IB presents
> these devices. For example, IPoIB reports itself as an 802.3
> device, but Ethernet headers are never sent on the wire. An
> application that rolls its own Ethernet packets won't work
> properly unless the target has been resolved via ARP first.
>
> There are other limitations, and from my talks with Microsoft
> about WHQL for IB, they would define what tests IB devices
> are exempt from due to not being able to pass.
>
> > The actual IB fabric driver may have to get certification as an
> > "other" device.
>
> I was pretty optimistic about the "other" device WHQL
> program, but I heard that was being cancelled.
>
> > Getting data center level WHQL
> > certification for everying may be extreemly hard. On the
> other hand,
> > iSCSI and TOE ethernet I do believe will have very official WHQL
> > certification. My experience is 10 GbE TOE ethernet goes
> pretty fast
> > and current chips do RDMA and direct iSCSI and TCP DMA
> transfers into
> > appropriate buffers.
>
> I'm hoping that IB attached storage (SRP or iSER) will fit
> nicely into the iSCSI WHQL program. I don't know how well
> the RDMA and TCP chimney stuff will apply.
> It makes sense that things work properly for TOE devices -
> the DMA mappings shouldn't be any different than non-TOE
> devices. Likewise, RDMA from properly mapped addresses (as
> done in IPoIB and SRP) will also work fine for kernel
> drivers. However, I would expect iWARP device vendors that
> supply a WSD provider to have the same issues with memory
> registrations that IB has - that of needing to register
> arbitrary user-allocated memory for DMA access.
>
> > > How do you solve cache coherency issues without getting rid of
> > > kernel bypass?
> > > Making calls to the kernel to flush the CPU or DMA controller
> > > buffers for every user-mode I/O is going to take away the
> benefits
> > > of doing kernel bypass in the first place. That's not to say we
> > > won't come to this conclusion, I'm just throwing the
> questions out
> > > there. I'm not expecting you to have the answers - they're just
> > > questions that I don't know how to answer, and I appreciate the
> > > discussion.
> >
> > It's only a problem if you allow arbitrary buffers, if buffers are
> > allocated in the "proper" way, it's not an issue. Your memory
> > performance may be less on some systems, although those
> systems will
> > tend to be higher powered systems to start with (like 16/32
> core SMP).
>
> This gets back the WSD infrastructure issue I raised above.
> I'm hoping that we're just missing something and that
> Microsoft has already solved things.
>
> > This message got rather longer than expected, sorry.
>
> No worries - lots of good information.
>
> Thanks!
>
> - Fab
>
>
> _______________________________________________
> openib-windows mailing list
> openib-windows at openib.org
> http://openib.org/mailman/listinfo/openib-windows
>
More information about the ofw
mailing list