[ofiwg] draft of proposed API changes for libfabric 2.0

Hefty, Sean sean.hefty at intel.com
Fri Sep 29 09:45:55 PDT 2023


Based on ofiwg and github discussions, I created an initial draft of changes targeting libfabric 2.0.  All changes are preliminary and need community approval.  Changes are being staged in this pull request:

https://github.com/ofiwg/libfabric/pull/9384

I plan on walking through the changes in the ofiwg starting next week.

Summary:
2.0 targets *ABI* compatibility with 1.x, but NOT *API* compatibility.  In more practical terms, this means that some calls that an app makes against 1.x that work may fail against 2.x (e.g. returns EINVAL or ENOSYS).  For reference, the latest libfabric API is 1.19, while the current ABI is 1.6.  libfabric 2.0 is likely to align with ABI 1.7 or 1.8. 

Included in the changes are updates to the fabric.7 man page which outlines:

- API changes
- Reasons for a change
- Expected impact to applications
- Actions an application would need to take if impacted

The updates are included below for convenience.  (Warning, I haven't proofread the text yet and wrote it while watching (American) football, so consider it a rough draft as well.)

-  Sean

---

# VERSION 2

The version 2 series focuses on streamlining the libfabric API.  It removes
features from the API to enable simplifying and optimizing provider
implementations, plus provide greater design guidance to users.  Version 2 is
ABI compatible with version 1, but not API compatible.  Data structures and
APIs that are common between version 1 and version 2 share the same layouts
and signatures.  Applications, once updated to the version 2 APIs, should be
able to support both libfabric version 1 and 2 with minimal to no differences
to their code bases.   The following section describe the significant changes
between version 1 and 2 which may impact applications.  New features being
introduced in version 2 are not included, as the impact is the same as if
they were added into a newer version 1 release.

## Address Vector Simplified

Support for asynchronous insertion of addresses into an AV is removed.  The
feature was never natively supported by a provider.  Support for FI_AV_MAP
is removed.  Providers are required to support FI_AV_TABLE.  The expected
impact to applications with these changes is minimal.  Applications that use
FI_AV_MAP can use the same logic with FI_AV_TABLE.  By limiting fi_addr_t
values to indices, additional features and optimizations become available
which would otherwise require more significant changes to the API.

## Completion Queue Restriction

Endpoints may only be associated with a single completion queue.  Support
for transmit and receive completions to be written to separate CQs is
removed.  In many cases, providers are unable to completely decouple transmit
operations from receives and must always drive progress across both CQs.
This is required by providers that implement portions of the API in
software, such as tag matching or support for advanced completion levels
(e.g. FI_DELIVERY_COMPLETE).  Limiting an endpoint to a single completion
queue simplifies and optimizes the provider implementation and allows for
a more direct mapping between a libfabric CQ and hardware.

Note that this restriction applies to endpoints that are bound to
completion queues.  This includes receive and transmit contexts that are
part of a scalable endpoint.  When scalable endpoints are involved,
separate receive contexts and transmit contexts may still be associated
with separate completion queues.

The expected impact of this change on applications is none to moderate.
Applications that use a single CQ per endpoint are not impacted.  For
applications that bind an endpoint to separate send and receive CQs,
they must update to use a single CQ.  Completion flags can then be used
to determine if a CQ entry is for a send or receive completion.  See the
Threading Model section for more information on possible impacts to
serialization.

## Completions Simplified

The completion options FI_ORDER_STRICT and FI_ORDER_DATA are removed, along
with FI_ORDER_NONE.  The latter is a flag that was defined as 0, making its
use confusing.  There are no guarantees on completion order, including how
the data is written to target memory.  This gives providers greater
flexibility on implementation and wire protocols.

For applications that use unconnected endpoints, there is no expected impact
with these changes.  Most providers did not support these ordering requirements
over unconnected endpoints.  However, applications that use connected endpoints
might be impacted.  Some providers did implement these options over connected
endpoints.  Applications which are designed around these features will
require updates to their design.

## Endpoint Type Removal

Experimental endpoint types that were added for exploratory purposes
are removed.  These are FI_EP_SOCK_DGRAM and FI_EP_SOCK_STREAM.  There
is no expected impact to applications with this change.

## Header File Cleanup

Definitions which should have been kept internal to libfabric were exposed
directly to applications by being included in the publicly installed
header files.  Such definitions are moved into internal headers.
Applications using these definitions will fail to compile against the
version 2 header files and need to define equivalent values and
macros in their own code bases.  Removed definitions were not documented
in libfabric man pages.

## Feature (Deprecated) Removal

Features and attributes that were marked as deprecated in version 1
releases were removed from the API.  This includes
fi_rx_attr::total_buffered_recv, FI_LOCAL_MR, FI_MR_BASIC, and
FI_MR_SCALABLE.  Most applications are not expected to be impacted by these
changes.

## Feature (Unimplemented) Removal

Features which were never implemented by an upstream provider or known
to be implemented by an out of tree provider are removed.  This includes
FI_VARIABLE_MSG, FI_XPU_TRIGGER, FI_RESTRICTED_COMP, FI_WAIT_MUTEX_COND,
and FI_NOTIFY_FLAGS_ONLY.  There is no expected impact to applications
with this change.

## fi_info Requirements

The APIs now require that all fi_info structures passed to any function
call be allocated by libfabric.  Applications are no longer permitted to
hand-craft an fi_info structure directly.  This enables libfabric to
allocate and access hidden and provider specific fields, which can be passed
between API calls.  For most applications, we expect no impact by this
change.  Most applications use the fi_allocinfo() and fi_dupinfo() calls
to allocate and duplicate fi_info structures prior to modifying them.

## Logging Simplified

The fi_log related calls are updated to replace the enum fi_log_subsys with
int flags.  Subsystem based filtering has been of limit use, while
complicated the logging implementation.  Additionally, provider developers
can easily select the wrong subsystem for a given log message.  This change
mostly impacts providers, but also impacts users displaying logging data.
Applications that intercept logging messages should be inspected to see
if the subsys parameters are interpreted.

Version 2 providers will pass in a flags value of 0 instead, which
corresponded with the FI_LOG_CORE subsystem.  The flags parameter provides
a mechanism by which the logging API can be extended in a compatible manner.

## Memory Registration Simplified

Support for asynchronous memory registration is removed.  The feature was
never natively supported by a provider.  Deprecated memory
options: FI_LOCAL_MR, FI_MR_BASIC, and FI_MR_SCABLE, are also removed.  Most
applications are not expected to be impacted by these changes.  Applications
still using either FI_MR_BASIC or FI_MR_SCALABLE must be updated to use the
modernized mr_mode bits, which can provide equivalent functionality.

## Poll Sets Removal

The poll set (fid_poll) object and poll set APIs are removed.  This includes
the calls fi_poll(), fi_poll_add(), and fi_poll_del().  Poll sets are
implemented as simple iterators used to drive progress over an array of CQs
and counters.  Such iteration is trivial for an application to implement,
but the possibility of needing to support poll sets adds undo complexity
to providers.  Poll sets also introduce inefficiencies by driving progress
without retrieving completions directly.  This results in applications
needing to recheck the CQs and counters, which drives progress an additional
time.

It is believed that poll sets are not widely used by applications.  However,
for applications that do use poll sets, the poll set calls would need to be
replaced equivalent functionality.  It is expected that such functionality
would be straightforward to implement for most applications.

## Progress Model Simplified

The separate fi_domain_attr:control_progress and data_progress options are
combined under a single fi_domain_attr::progress field.  The new field is an
alias for data_progress.  There is minimal expected impact to applications
with this change.  In most situations, control and data progress
operations are closely coupled, such that the data progress option took
preference.

However, applications using FI_EP_MSG may see an impact if they
rely on FI_PROGRESS_AUTO for control operations, such as handling connections,
but FI_PROGRESS_MANUAL for data operations.  Depending on the provider
implementation, manual progress may require more frequent checking for
asynchronous events (i.e. reading an EQ) to prevent stalls or timeouts
handling connection events.

Providers are requested to consider such impacts when migrating to
a single progress model.

## Provider Removal

Several providers that are not actively updated have been removed.
Applications that use these providers will need to use a version 1 library.
Removed providers are: bgq, gni, netdir, rstream, sockets, and usnic.
Applications using the sockets provider may be able to use the tcp as
an alternative.  Windows Network Direct support is available through the
verbs provider.

## Threading Model Simplified

To help align application design with provider implementation for lockless
operation, the threading models align on FI_THREAD_DOMAIN for standard endpoints
and FI_THREAD_COMPLETION for scalable endpoints.  The threading models for
FI_THREAD_FID and FI_THREAD_ENDPOINT are removed.  Multi-threaded applications
using those threading models that desire lockless operation should consider
mapping libfabric resources around the domain or completion structures,
based on their endpoint type.  Requests for FI_THREAD_FID or FI_THREAD_ENDPOINT
will be treated as FI_THREAD_SAFE.

There is minimal to no expected impact to applications with this change, as
most providers did not optimize for the removed threading models.

## Wait Sets Removal

The wait set (fid_wait) object and wait set APIs are removed.  This includes
the fi_wait() call and the FI_WAIT_SET enum fi_wait_obj.  Wait sets add
complexity and inefficiency to the provider implementations by restricting
the wait object to which completion events must be reported.  The removal of
wait sets does not impact support for blocking reads (e.g. fi_cq_sread) or
the use of other wait objects (e.g. FI_WAIT_FD).

It is believed that wait sets are not widely used by applications.  However,
for applications that use wait sets, wait set functionality would need to
be replaced with equivalent functionality.  This can be accomplished by
requesting native operating system wait objects (e.g. FI_WAIT_FD), using
fi_control() operations to retrieve the wait object, and using a native
operating system call (e.g. poll() or epoll()).


More information about the ofiwg mailing list