[libfabric-users] Two issues while using  libfabric

Hefty, Sean sean.hefty at intel.com
Tue Dec 1 10:40:27 PST 2020


> Our team have met several problems while developing network system on Mellanox ConnectX
> using libfabric-1.7.0-1.el7.
> 
> We would appreciate your advices on these issues.

Btw, this email list is subscription based.  If you're not subscribed to the list, any emails sent to it are placed into a pending queue and must be manually accepted to the list.


> A, B, and C three nodes build links with each other, each node works as both a fabric
> client and a fabric server.
> For example, Node A had 2 RDMA links : A-->B and  C-->A.  And  we had met two problems
> while using  libfabric:
> 
> 1. Multiple nodes build links in pairs, how to register memory to fi_write to different
> nodes?
> 
> we found that these links had different fi_info and fi_domain, so the shared data must
> be registered onto both links, and each link had it's own key/desc for the same memory,
> which maked app much more complicated. This is unacceptable.  We believed that Links
> on the same RNIC should share the same domain.
> 
> https://github.com/ofiwg/libfabric/issues/6259 )

Libfabric is not a hardware based interface.  However a domain often maps to a single NIC.  You can open multiple endpoints off a single domain.


> 2. All the  nodes have same same hardware and same Centos os,  some nodes fi_mr_reg
> failed with error -12, while others worked OK.
> 
> We found it failed when libfabric using the fi_ibv_mr_cache_ops mode and it worked ok
> when using fi_ibv_mr_ops.  We don't know how to resolve it.
> 
> Is there any documents or manuals  about these modes ?  How can we confirm it's working
> mode ?

You will want to disable the MR cache for the 1.7 release.  You can use the fi_info (-e option I think) utility to examine the different environment variables related to the cache.  I don't remember which setting will disable the cache in that release.

- Sean


More information about the Libfabric-users mailing list