[libfabric-users] fi_endpoint rejects parameters returned by fi_getinfo (verbs provider)

Hefty, Sean sean.hefty at intel.com
Tue Jun 20 14:34:04 PDT 2023


> 1) Initialize hints:
> ```
> hints->caps = FI_MSG | FI_RMA;
> hints->ep_attr->type = FI_EP_MSG;
> hints->mode = FI_RX_CQ_DATA;
> hints->fabric_attr->prov_name = strdup("verbs");
> hints->domain_attr->name = strdup("mlx5_3");
> hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_ALLOCATED | FI_MR_PROV_KEY |
> FI_MR_VIRT_ADDR;
> hints->domain_attr->caps = FI_LOCAL_COMM | FI_REMOTE_COMM; // loopback
> 
> ```
> 
> 2) Pass these hints to fi_getinfo() with the FI_SOURCE flag and our port number, and
> use the returned info to open the fabric, domain, and passive endpoint, all of which
> are opened successfully.

Can you use fi_tostr() to print the fi_info returned from fi_getinfo() that you use?

> 3) Clear the src_addr field in the first info object, and pass it as hints to
> fi_getinfo(), along with a peer's hostname and port number, to get a new info object
> for a peer connection.

Do you also clear the src_addrlen?

Please print the fi_info again prior to passing it into fi_getinfo().  I want to verify that it did not change. 

> 4) Pass this new info object, along with the domain I opened earlier, to fi_endpoint to
> create an endpoint for the peer connection. This is where I am running into FI_ENODATA
> errors...

The opened domain should have made an internal copy of the fi_info that it was passed.  That is used for the comparison.

> Looking at the debug output, I saw the problem was that it was saying the the max
> message size is not supported:
> ```
> libfabric:9509:verbs:core:ofi_check_ep_attr():691<info> Max message size too large
> libfabric:9509:verbs:core:ofi_check_ep_attr():692<info> Supported: 0
> libfabric:9509:verbs:core:ofi_check_ep_attr():692<info> Requested: 1073741824

No idea how supported ended up set to 0.  The above prints might help show whether it changed somewhere.

> ```
> I did not provide the requested value; 1073741824 is the default max message size
> returned from fi_getinfo(), and it matches the max message size reported by `fi_info -p
> verbs -d mlx5_3 -t FI_EP_MSG`. I don't need messages that large for my application, but
> I do need messages larger than zero!
> 
> So, my question is, why is fi_endpoint() rejecting this parameter if it is the default
> value? Is it getting the wrong provider/domain information somehow? I tried explicitly
> setting the domain and fabric pointers in the info object to the open instances, but it
> did not resolve the error. Any help is appreciated.

Please try a newer version of libfabric.  It's possible you're hitting into a problem that has been fixed in a later version.

- Sean


More information about the Libfabric-users mailing list