<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=US-ASCII">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2654.45">
<TITLE>RE: [Openib-windows] Windows DMA model</TITLE>
</HEAD>
<BODY>
<P><FONT SIZE=2>FYI: The new low level driver, i'm working no, is using IoGetDmaAdapter and AllocateCommonBuffer to perform the mappings right ...</FONT></P>
<P><FONT SIZE=2>> -----Original Message-----</FONT>
<BR><FONT SIZE=2>> From: Fab Tillier [<A HREF="mailto:ftillier@silverstorm.com">mailto:ftillier@silverstorm.com</A>]</FONT>
<BR><FONT SIZE=2>> Sent: Wednesday, October 19, 2005 9:15 AM</FONT>
<BR><FONT SIZE=2>> To: 'Jan Bottorff'; openib-windows@openib.org</FONT>
<BR><FONT SIZE=2>> Subject: RE: [Openib-windows] Windows DMA model</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> > From: Jan Bottorff [<A HREF="mailto:jbottorff@xsigo.com">mailto:jbottorff@xsigo.com</A>]</FONT>
<BR><FONT SIZE=2>> > Sent: Tuesday, October 18, 2005 10:24 PM</FONT>
<BR><FONT SIZE=2>> > </FONT>
<BR><FONT SIZE=2>> > Why oh why does my ctrl-v sometimes send my half written email</FONT>
<BR><FONT SIZE=2>> > message... anyway...</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Thanks for resending, and also for bringing up these issues. I really</FONT>
<BR><FONT SIZE=2>> appreciate the feedback.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> > I asked Microsoft about just calling MmGetPhysicalAddress </FONT>
<BR><FONT SIZE=2>> for DMA and</FONT>
<BR><FONT SIZE=2>> > they responded:</FONT>
<BR><FONT SIZE=2>> > </FONT>
<BR><FONT SIZE=2>> > =======================</FONT>
<BR><FONT SIZE=2>> > To summarize, your drivers will break on chipsets that need </FONT>
<BR><FONT SIZE=2>> extra cache</FONT>
<BR><FONT SIZE=2>> > coherence help and on virtualized systems where there is an I/O MMU.</FONT>
<BR><FONT SIZE=2>> > Neither of these is particularly common today, but they'll </FONT>
<BR><FONT SIZE=2>> be much more</FONT>
<BR><FONT SIZE=2>> > common in the near future. The drivers will also break on non-x86</FONT>
<BR><FONT SIZE=2>> > machines where DMA address don't equal CPU-relative </FONT>
<BR><FONT SIZE=2>> physical address,</FONT>
<BR><FONT SIZE=2>> > but those machines have become very uncommon in the last five years.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> I wholeheartedly agree with this assessment. Note that support for</FONT>
<BR><FONT SIZE=2>> virtualization will require a whole lot of work - to support </FONT>
<BR><FONT SIZE=2>> kernel bypass in a</FONT>
<BR><FONT SIZE=2>> virtual machine, where the application in user-mode in the </FONT>
<BR><FONT SIZE=2>> virtual machine has</FONT>
<BR><FONT SIZE=2>> to bypass both the virtual machine kernel as well as the </FONT>
<BR><FONT SIZE=2>> host's kernel. It</FONT>
<BR><FONT SIZE=2>> would be great to figure out how to do this in Windows. I </FONT>
<BR><FONT SIZE=2>> currently don't</FONT>
<BR><FONT SIZE=2>> really have a clue, though.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> I look forward to any input you might have as we try to find </FONT>
<BR><FONT SIZE=2>> solutions to these</FONT>
<BR><FONT SIZE=2>> current deficiencies.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> > >I don't know if you saw my RFC emails about that API or not.</FONT>
<BR><FONT SIZE=2>> > </FONT>
<BR><FONT SIZE=2>> > I didn't see that, as I'm a very new member of the list. It </FONT>
<BR><FONT SIZE=2>> sounds like</FONT>
<BR><FONT SIZE=2>> > your saying the current low level interface API to things will be</FONT>
<BR><FONT SIZE=2>> > changing in the future?</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Yes, the interface between the access layer and the HCA HW </FONT>
<BR><FONT SIZE=2>> driver will change at</FONT>
<BR><FONT SIZE=2>> first, to be followed by the ULP to Access Layer interface. </FONT>
<BR><FONT SIZE=2>> I'll be getting</FONT>
<BR><FONT SIZE=2>> back to that soon I hope, and will be sending out headers for </FONT>
<BR><FONT SIZE=2>> comments.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> > >Also, driver verifier will completely break DMA for user-mode as it</FONT>
<BR><FONT SIZE=2>> > >forces double buffering to check that DMA mappings are </FONT>
<BR><FONT SIZE=2>> used properly.</FONT>
<BR><FONT SIZE=2>> > </FONT>
<BR><FONT SIZE=2>> > If you turn on DMA verification in driver verifier I believe it will</FONT>
<BR><FONT SIZE=2>> > double buffer ALL correctly done DMA, to help find memory boundary</FONT>
<BR><FONT SIZE=2>> > violations. This is also a check of the hardware.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> That's correct. Kernel clients should definitely do all DMA </FONT>
<BR><FONT SIZE=2>> operations by the</FONT>
<BR><FONT SIZE=2>> book. The question is whether registrations (both user and </FONT>
<BR><FONT SIZE=2>> kernel) should use</FONT>
<BR><FONT SIZE=2>> the DMA mapping functionality, or just use the physical </FONT>
<BR><FONT SIZE=2>> addresses from the MDL</FONT>
<BR><FONT SIZE=2>> after it has been locked down. The former will result in </FONT>
<BR><FONT SIZE=2>> verifier breaking</FONT>
<BR><FONT SIZE=2>> anything that uses registered memory, and the latter will </FONT>
<BR><FONT SIZE=2>> result in broken DMA</FONT>
<BR><FONT SIZE=2>> due to the assumption that CPU and bus addresses are </FONT>
<BR><FONT SIZE=2>> consistent and cache</FONT>
<BR><FONT SIZE=2>> coherent. I have doubts that kernel bypass could even work </FONT>
<BR><FONT SIZE=2>> without cache</FONT>
<BR><FONT SIZE=2>> coherency, though.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> For example, for internal ring buffers like those used for </FONT>
<BR><FONT SIZE=2>> CQE and WQE rings,</FONT>
<BR><FONT SIZE=2>> performing proper DMA mappings will break the hardware if </FONT>
<BR><FONT SIZE=2>> verifier remaps these.</FONT>
<BR><FONT SIZE=2>> I suppose a way around that is to allocate those buffers one </FONT>
<BR><FONT SIZE=2>> page at a time with</FONT>
<BR><FONT SIZE=2>> AllocateCommonBuffer, build up an MDL with the underlying CPU </FONT>
<BR><FONT SIZE=2>> physical pages</FONT>
<BR><FONT SIZE=2>> using MmGetPhysicalAddress on the returned virtual address, </FONT>
<BR><FONT SIZE=2>> remap it to a</FONT>
<BR><FONT SIZE=2>> contiguous virtual memory region using </FONT>
<BR><FONT SIZE=2>> MmMapLockedPagesSpecifyCache, and then</FONT>
<BR><FONT SIZE=2>> use the bus physical addresses originally returned by </FONT>
<BR><FONT SIZE=2>> AllocateCommonBuffer to</FONT>
<BR><FONT SIZE=2>> program the HCA. I don't know if this sequence would work </FONT>
<BR><FONT SIZE=2>> properly, and it</FONT>
<BR><FONT SIZE=2>> still doesn't solve the issue of an application registering </FONT>
<BR><FONT SIZE=2>> its buffers.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> > Passing tests with driver verifier active will be required </FONT>
<BR><FONT SIZE=2>> to obtain any</FONT>
<BR><FONT SIZE=2>> > kind of WHQL driver certification. Commercial users will </FONT>
<BR><FONT SIZE=2>> absolutely need</FONT>
<BR><FONT SIZE=2>> > these certifications.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Agreed, but any WHQL certifications require Microsoft to define a WHQL</FONT>
<BR><FONT SIZE=2>> certification process for InfiniBand devices. That said, </FONT>
<BR><FONT SIZE=2>> even without an</FONT>
<BR><FONT SIZE=2>> official IB WHQL program, the WHQL tests are a valuable test </FONT>
<BR><FONT SIZE=2>> tool, as is</FONT>
<BR><FONT SIZE=2>> verifier. Once Microsoft has a program for IB, I expect </FONT>
<BR><FONT SIZE=2>> they'll have thought of</FONT>
<BR><FONT SIZE=2>> how to handle kernel bypass and DMA mappings for memory registrations.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> > >There is some work that needs to happen to get MAD traffic </FONT>
<BR><FONT SIZE=2>> to do proper</FONT>
<BR><FONT SIZE=2>> > >DMA mappings, but upper level protocols already do the right thing.</FONT>
<BR><FONT SIZE=2>> > </FONT>
<BR><FONT SIZE=2>> > I can understand how higher level packet traffic from a </FONT>
<BR><FONT SIZE=2>> kernel mode NDIS</FONT>
<BR><FONT SIZE=2>> > based driver or buffers from a STORPORT based storage </FONT>
<BR><FONT SIZE=2>> driver can have</FONT>
<BR><FONT SIZE=2>> > the correct mapping already. It also sounds like other </FONT>
<BR><FONT SIZE=2>> kernel drivers</FONT>
<BR><FONT SIZE=2>> > that use the IBAL interface currently aren't assured DMA is correct.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> It's up to the client to do DMA mappings since the client </FONT>
<BR><FONT SIZE=2>> posts work requests to</FONT>
<BR><FONT SIZE=2>> their queue pairs. The problem is that it's not clear for </FONT>
<BR><FONT SIZE=2>> which device to get a</FONT>
<BR><FONT SIZE=2>> DMA_ADAPTER and the new interface will make that much clearer.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> The only part where IBAL is deficient with respect to DMA </FONT>
<BR><FONT SIZE=2>> mappings is for MADs.</FONT>
<BR><FONT SIZE=2>> Anything else is the responsibility of the client. There's </FONT>
<BR><FONT SIZE=2>> no clean way to make</FONT>
<BR><FONT SIZE=2>> IBAL know exactly how to perform DMA mappings for all users </FONT>
<BR><FONT SIZE=2>> automatically.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> > >For the time being, since we're running on platforms where </FONT>
<BR><FONT SIZE=2>> the CPU and</FONT>
<BR><FONT SIZE=2>> > >bus addresses are consistent, it hasn't been an issue.</FONT>
<BR><FONT SIZE=2>> > </FONT>
<BR><FONT SIZE=2>> > Microsoft seems to say it's more than just an address mapping issue;</FONT>
<BR><FONT SIZE=2>> > it's also a cache coherency issue. I'm not surprised that </FONT>
<BR><FONT SIZE=2>> it's desirable</FONT>
<BR><FONT SIZE=2>> > to get software to help with cache coherency as the number </FONT>
<BR><FONT SIZE=2>> of processor</FONT>
<BR><FONT SIZE=2>> > cores grows, especially on AMD processor systems with </FONT>
<BR><FONT SIZE=2>> essentially a NUMA</FONT>
<BR><FONT SIZE=2>> > architecture.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> Aren't the AMD processors cache coherent, even in their NUMA </FONT>
<BR><FONT SIZE=2>> architecture?</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> How do you solve cache coherency issues without getting rid </FONT>
<BR><FONT SIZE=2>> of kernel bypass?</FONT>
<BR><FONT SIZE=2>> Making calls to the kernel to flush the CPU or DMA controller </FONT>
<BR><FONT SIZE=2>> buffers for every</FONT>
<BR><FONT SIZE=2>> user-mode I/O is going to take away the benefits of doing </FONT>
<BR><FONT SIZE=2>> kernel bypass in the</FONT>
<BR><FONT SIZE=2>> first place. That's not to say we won't come to this </FONT>
<BR><FONT SIZE=2>> conclusion, I'm just</FONT>
<BR><FONT SIZE=2>> throwing the questions out there. I'm not expecting you to </FONT>
<BR><FONT SIZE=2>> have the answers -</FONT>
<BR><FONT SIZE=2>> they're just questions that I don't know how to answer, and I </FONT>
<BR><FONT SIZE=2>> appreciate the</FONT>
<BR><FONT SIZE=2>> discussion.</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> - Fab</FONT>
<BR><FONT SIZE=2>> </FONT>
<BR><FONT SIZE=2>> _______________________________________________</FONT>
<BR><FONT SIZE=2>> openib-windows mailing list</FONT>
<BR><FONT SIZE=2>> openib-windows@openib.org</FONT>
<BR><FONT SIZE=2>> <A HREF="http://openib.org/mailman/listinfo/openib-windows" TARGET="_blank">http://openib.org/mailman/listinfo/openib-windows</A></FONT>
<BR><FONT SIZE=2>> </FONT>
</P>
</BODY>
</HTML>