[libfabric-users] [ofiwg] mmap'ed kernel memory in fi_mr_reg

Jörn Schumacher joern.schumacher at cern.ch
Thu Dec 6 09:32:15 PST 2018


On 11/27/2018 09:08 AM, Jörn Schumacher wrote:
> On 11/26/2018 07:44 PM, Jason Gunthorpe wrote:
>> On Mon, Nov 26, 2018 at 04:42:49PM +0100, Jörn Schumacher wrote:
>>> On 11/19/2018 08:42 PM, Hefty, Sean wrote:
>>>>>> The only alternative I can think of is to try a normal registration
>>>>> call, and if that fails, try again using the physical flag.  Would
>>>>> this work, or does the normal registration call succeed, but produce
>>>>> an unusable MR?
>>>>>
>>>>> This would not work because of a subtlety of the physical memory
>>>>> registration. The reason is that actually NULL is passed as address in
>>>>> the call. Check the github link to my patch in the other E-Mail, there
>>>>> is a line that replaces the address with NULL.
>>>>>
>>>>> If a user passes an illegal virtual address the call should fail. But
>>>>> if the libfabric call falls back to the physical address registration,
>>>>> this would then actually succeed as the address is replaced with NULL.
>>>>
>>>> I looked back at the patches and related documentation.  IMO, the 
>>>> verbs physical memory registration interface is just weird.  There 
>>>> is no association between the actual pages and the region AFAICT.
>>>
>>> Indeed this is a rather strange extension.
>>>
>>> I came across a potential solution to adjust our driver to produce 
>>> memory
>>> that is compatible with the RDMA stack in the kernel. Supposedly 
>>> there is an
>>> alternative to remap_pfn_range. In that case we would not need the 
>>> physical
>>> memory registration in libfabric anymore and the overall solution 
>>> would be
>>> cleaner (not dependent on the verbs provider)
>>
>> Several other people have been interested in this, I think many would
>> appreciate it if you share your solution to the linux-rdma mailing
>> list.
> 
> Yes, I was thinking of writing this up for the list, will do that in the 
> next couple of days after running some tests.

I posted the solution we found to the Linux-RDMA mailing list, I copy it 
below for reference in this list.

In the end I do not think we need support in libfabric for the physical 
address registration, the other solution we found seems a lot cleaner.

Thanks for the help & cheers,
Jörn

---

Eventually we found a solution that works for our use case. I would like 
to share it here in case somebody stumbles over this thread with a 
similar problem.

To summarize the problem once more: We have a driver that manages large 
buffers that are used by a PCIe device for DMA writes. We would like to 
use these buffers in RDMA calls, but the ibv_reg_mr call fails because 
the mmap'ed memory address is incompatible with the RDMA driver stack.

The driver mentioned above was written by Markus and is not published 
anywhere right now, but the code could be shared (without guarantee of 
support) if it is of interest to anybody.

In fact there are two approaches that work.

Approach 1:
There is a verbs extensions that allows the registration of physical 
addresses. This verb is not available in the mainline kernel, but for 
example the Mellanox OFED driver supports it. The concept is written up 
in [1], but in a nutshell it involves calling ibv_exp_reg_mr with the 
IBV_EXP_ACCESS_PHYSICAL_ADDR flag. The call is not actually associated 
with any memory address, but rather registers the full physical address 
space.

The *physical* address can then be used in verb calls. Our driver 
exposes the physical address of managed memory to userspace, so this 
approach works fine.

To get this to play together with libfabric we had to patch it slightly 
[2]. However, this is unlikely to land in mainline libfabric.


Approach 2:
The other idea is to mmap the device driver's memory into user space 
such that it is compatible with the RDMA drivers. Our original driver 
uses remap_pfn_range, which works fine, but the resulting memory is not 
compatible with the get_user_pages call that is used in the Linux RDMA 
drivers. An alternative to remap_pfn_range is to provide an 
implementation of the nopage method to the mapping VMA. This is 
described in detail in the book of Rubini [3].

The mmap using the "nopage"-approach produces a mapping that is 
compatible with get_user_pages. Hence, the virtual address of such a 
mapping can be directly used in any libibverbs or libfabric calls.


We opted for the 2nd approach.


Cheers,
   Markus & Jörn


[1] https://community.mellanox.com/docs/DOC-2480
[2] https://github.com/joerns/libfabric/compare/v1.6.x...joerns:phys_addr_mr
[3] https://lwn.net/Kernel/LDD3/, Chapter 15 "Memory Mapping and DMA"


More information about the Libfabric-users mailing list