[libfabric-users] fi_endpoint rejects parameters returned by fi_getinfo (verbs provider)

George Hodgkins chho1459 at colorado.edu
Tue Jun 20 13:05:22 PDT 2023


Hi all,
I am working on porting an RDMA application from libibverbs to libfabric
(version 1.11), but I am running into trouble with basic endpoint
configuration using the verbs provider. These are the steps I am following
to set up the network:

1) Initialize hints:
```
hints->caps = FI_MSG | FI_RMA;
hints->ep_attr->type = FI_EP_MSG;
hints->mode = FI_RX_CQ_DATA;
hints->fabric_attr->prov_name = strdup("verbs");
hints->domain_attr->name = strdup("mlx5_3");
hints->domain_attr->mr_mode = FI_MR_LOCAL | FI_MR_ALLOCATED |
FI_MR_PROV_KEY | FI_MR_VIRT_ADDR;
hints->domain_attr->caps = FI_LOCAL_COMM | FI_REMOTE_COMM; // loopback
```

2) Pass these hints to fi_getinfo() with the FI_SOURCE flag and our port
number, and use the returned info to open the fabric, domain, and passive
endpoint, all of which are opened successfully.

3) Clear the src_addr field in the first info object, and pass it as hints
to fi_getinfo(), along with a peer's hostname and port number, to get a new
info object for a peer connection.

4) Pass this new info object, along with the domain I opened earlier, to
fi_endpoint to create an endpoint for the peer connection. This is where I
am running into FI_ENODATA errors...

Looking at the debug output, I saw the problem was that it was saying the
the max message size is not supported:
```
libfabric:9509:verbs:core:ofi_check_ep_attr():691<info> Max message size
too large
libfabric:9509:verbs:core:ofi_check_ep_attr():692<info> Supported: 0
libfabric:9509:verbs:core:ofi_check_ep_attr():692<info> Requested:
1073741824
```
I did not provide the requested value; 1073741824 is the default max
message size returned from fi_getinfo(), and it matches the max message
size reported by `fi_info -p verbs -d mlx5_3 -t FI_EP_MSG`. I don't need
messages that large for my application, but I do need messages larger than
zero!

So, my question is, why is fi_endpoint() rejecting this parameter if it is
the default value? Is it getting the wrong provider/domain information
somehow? I tried explicitly setting the domain and fabric pointers in the
info object to the open instances, but it did not resolve the error. Any
help is appreciated.

Thanks,
George
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20230620/02df1200/attachment.htm>


More information about the Libfabric-users mailing list