[libfabric-users] Two issues while using libfabric
Hefty, Sean
sean.hefty at intel.com
Tue Dec 1 10:40:27 PST 2020
> Our team have met several problems while developing network system on Mellanox ConnectX
> using libfabric-1.7.0-1.el7.
>
> We would appreciate your advices on these issues.
Btw, this email list is subscription based. If you're not subscribed to the list, any emails sent to it are placed into a pending queue and must be manually accepted to the list.
> A, B, and C three nodes build links with each other, each node works as both a fabric
> client and a fabric server.
> For example, Node A had 2 RDMA links : A-->B and C-->A. And we had met two problems
> while using libfabric:
>
> 1. Multiple nodes build links in pairs, how to register memory to fi_write to different
> nodes?
>
> we found that these links had different fi_info and fi_domain, so the shared data must
> be registered onto both links, and each link had it's own key/desc for the same memory,
> which maked app much more complicated. This is unacceptable. We believed that Links
> on the same RNIC should share the same domain.
>
> https://github.com/ofiwg/libfabric/issues/6259 )
Libfabric is not a hardware based interface. However a domain often maps to a single NIC. You can open multiple endpoints off a single domain.
> 2. All the nodes have same same hardware and same Centos os, some nodes fi_mr_reg
> failed with error -12, while others worked OK.
>
> We found it failed when libfabric using the fi_ibv_mr_cache_ops mode and it worked ok
> when using fi_ibv_mr_ops. We don't know how to resolve it.
>
> Is there any documents or manuals about these modes ? How can we confirm it's working
> mode ?
You will want to disable the MR cache for the 1.7 release. You can use the fi_info (-e option I think) utility to examine the different environment variables related to the cache. I don't remember which setting will disable the cache in that release.
- Sean
More information about the Libfabric-users
mailing list