[ofiwg] variable length message support

Tue Feb 27 16:36:14 PST 2018

As mentioned in today's ofiwg, I've opened a PR that documents what I'm referring to as variable length message support.

https://github.com/ofiwg/libfabric/pull/3876

The commit message (justification) for this feature is:
---
Applications often need to receive messages, but have no idea
how large the message will be until the sender posts it. In
practical terms, the result is that many apps need to
implement a rendezvous protocol using some combination of
tagged messages + RMA, tagged messages, or normal messages + RMA.
This duplicates work that most of the OFI providers already do.
The apps must also guess the best way to do this, which may not
be optimal for a given provider.

This defines a new feature that abstracts the behavior that
applications are wanting -- the ability to receive a message
of variable length, with the size of the message determined
by the sender and given to the receiver as part of a 'pre'
completion event.
---

This feature came out of lengthy discussions based on examining application needs targeting AI and cloud.  But the solution seems so obvious in hindsight that we believe that it is applicable for HPC and multi-rail use cases.  In practice, I anticipate this feature impacting the underlying protocol.  It would be left to providers to determine the best mechanism for supporting this feature and ensuring that it does not negatively impact other aspects of the API (e.g. throwing off counters).

The main documentation for this feature is listed below for your reading pleasure.  It is modeled after the tagged interface's claim/discard feature.
---
+Variable length messages, or simply variable messages, are transfers
+where the size of the message is unknown to the receiver prior to the
+message being sent.  It indicates that the recipient of a message does
+not know the amount of data to expect prior to the message arriving.
+It is most commonly used when the size of message transfers varies
+greatly, with very large messages interspersed with much smaller
+messages.
+
+Variable messages are associated with a variable message threshold.
+The variable threshold indicates the size above which a transfer
+becomes a variable message.  The completion mechanism of variable
+messages differ from standard receive completions; however,
+completions at the sender remain unchanged.  Messages smaller than the
+threshold are treated as standard messages (or tagged messages if
+using the fi_tagged.3 operations).  That is, they consume posted
+application receive buffers and generate standard completions, including
+generating any possible errors that may arise.  The variable message
+threshold is configurable per endpoint, subject to provider limitations.
+Under most conditions, the threshold limit must be the same at both the
+sending and receiving endpoints, and must be configured prior to
+enabling the endpoint.
+
+When a variable message is ready to be received, a notification is
+generated on the associated receive completion queue.  Such
+completions will have the FI_VARIABLE_MSG flag set as part of the CQ
+entry.  The entry will report the length of the message to the receiver.
+Since variable message notifications are not directly associated with
+an application's posted receive operation, the CQ entry's op_context
+field will point to a struct fi_var_context.
+
+{% highlight c %}
+struct fi_var_context {
+	void *op_context;
+};
+{% endhighlight %}
+
+After being notified that a variable message is ready to be received,
+applications should either claim or discard the message.  To claim a
+message, an application must post a receive operation with the
+FI_CLAIM flag set.  The struct fi_var_context returned as part of the
+notification must be provided as the receive operation's context.  The
+struct fi_var_context contains an op_context field.  Applications may
+modify this field prior to claiming the message.  When the claim
+operation completes, a standard receive completion entry will be
+generated on the completion queue.  The op_context of the associated
+CQ entry will be set to the op_context value passed in through
+the fi_var_context structure.
+
+Applications that do not wish to receive a variable message that they
+were notified of may discard it.  To discard a message, an application
+must post a receive operation with the FI_DISCARD flag set.  The
+receive context should be the struct fi_var_context from the
+notification.  When the FI_DISCARD flag is set, the receive input
+buffer(s) and length parameters are ignored.
+
+The use of the FI_CLAIM and FI_DISCARD operation flags is also
+described with respect to tagged message transfers in fi_tagged.3.
+Variable length tagged messages will include the message tag as part
+of the message notification.
+
+Support for variable messages is indicated through the FI_VARIABLE_MSG
+capability bit.  Additionally, the variable length message threshold
+may be obtained and/or adjusted using an endpoint's
+fi_getopt/fi_setopt operations.
+
+The handling of variable message headers follows all message ordering
+restrictions assigned to and endpoint.  For example, completions
+may indicate the order in which variable messages arrived at the
+receive.  However, the transfer of variable message data should be
+treated as conceptually occurring out of band.  No ordering within or
+between the data of variable messages is implied.

- Sean