[libfabric-users] netdir / fi_endpoint

Hefty, Sean sean.hefty at intel.com
Tue Nov 2 08:49:14 PDT 2021


> So, the first place our code stumbled was with a call to fi_endpoint on the subscribing
> side. It returned FI_OTHER, which was not super helpful. Digging deeper into the netdir
> provider layer of the code, I could see that Network Direct call was returning
> ND_INSUFFICIENT_RESOURCES.

As just a warning, the ND provider has very limited testing. AFAIK, only Intel MPI makes use of it.

> I was hoping to find that I could control the values that were requested, but that
> appears not to be the case. I found the code in question,
> prov/netdir/src/netdir_ep.c:ofi_nd_ep_control, was simply passing in the maximum values
> it obtained from an earlier query.

It is bad form for the provider to use maximum values.  Other providers select a reasonable default (for queue sizes, usually in the hundreds), plus provide environment controls for configuration purposes.

> It seems to me that since the values being utilized are maximums, that it would be
> reasonable to expect to be able to query them via fi_getopt and then choose something
> other than the default via fi_setopt. I am new to libfabric, so perhaps there is
> already some other mechanism in place to achieve this that I am unaware of.

The expected behavior is for the application to specify their requirements through the fi_info hints, and for the provider to adjust the returned values based on that.

If you could, please open an issue on github with these details, marking the problem as a bug.  If you have a fix, it would be welcome.  :)

- Sean


More information about the Libfabric-users mailing list