[libfabric-users] problems with FI_MULTI_RECV buffer errors in the sockets provider (and perhaps others?)

Mon Apr 22 16:06:28 PDT 2019

> I’ve been debugging my way through the sockets provider for the last few days because I
> am having problems using FI_MULTI_RECV buffers.   The code seems to be doing what the
> man pages say it should do, but I can’t see how to recover from RX completion queue
> errors when they occur.   For this discussion I am just referencing difficulties
> processing a FI_ETRUNC error, I haven’t yet considered other errors.   And I haven’t
> yet tried other providers like TCP+RDM, that’s probably the next step, but in the
> meantime I’d like to have folks be aware of this.
> 
> I know about fi_setopt(…, FI_OPT_MIN_MULTI_RECV,…) command, but that doesn’t completely
> solve my problem, it is only useful for avoiding message truncation if all of your
> messages tend to be the same size, and you can predict the largest message you might
> ever receive, and then set the contents of the function’s  ‘opt_val’ parameter to that
> size.  And then in order to make multi-recv buffers efficient, the buffers should be
> many times that size.   That’s not always possible.
> 
> First mystery:   I have many FI_MULTI_RECV buffers queued on the server’s endpoint, yet
> the sockets provider will truncate a message in order to fit it into the multi-recv
> buffer at the head of the list.   Why deliberately destroy a message when there are
> follow-on buffers large enough to hold that message?   The code could instead create a
> FI_MULTI_RECV completion event for the current buffer, then place the new message in
> the next buffer and avoid any truncation.   Is there something I am missing here?

The intent is to support as many potential implementations as possible.  You could argue that the message should go to the next posted buffer.  But what if there isn't another posted buffer?  Or what if the next posted buffer is smaller than the message?  Should we require the implementation to check all buffers in the list looking for one that fits?  Should it look for the best fit?  Should it queue the message into an unexpected list?  What about the buffers that were skipped in the search?  Does the provider flush them, or keep them for future messages?  This is non-trivial.

Reporting ETRUNC is a valid option.  Although a provider could move to the next buffer and place the message there.  I think that would be supported by the API.  Anything beyond that is basically undefined.

Sockets takes the unforgiving approach.  We need to be careful setting a precedent on what future hardware must do.  (I'm not aware of hardware that supports multi-receive directly, though some portals based NICs might.)

You might want to look at variable length messages as an alternative for supporting messages with widely differing sizes.

> The fi_cq_err_entry struct for the error is produced by sock_cq_report_error(), which
> is called by sock_pe_report_rx_error(), which is called from sock_pe_process_rx_send().
> Note that the fi_cq_err_entry structure does not contain a fi_addr_t field, so the
> FI_SOURCE address of the client who sent the truncated message is lost, there is no way
> to identify the affected client and send it a “please-send-msg-XXX-again” message.  I
> don’t see any solution for this.

Hmm... this should be fixed.  I'm not sure how yet.

> There is nothing within the fi_cq_err_entry to identify which message from the client
> was lost either.   Here is an example of an error about to be posted on the socket
> completion queue’s ring buffer by sock_cq_report_error(), both in decimal and hex
> format.   The flags field corresponds to FI_MULTI_RECV|FI_READ|FI_MSG.   In my case I
> am not using ‘data’ or ‘tag’, many messages have length of 96, there is nothing unique
> here to identify the message.
> 
> 
> 
> (gdb) p errbuf
> 
> $9 = {op_context = 0x7fffe8000b40, flags = 66562, len = 96, buf = 0x0, data = 1365, tag
> = 0, olen = 4, err = 265, prov_errno = -265, err_data = 0x20000, err_data_size = 0}
> 
> (gdb) p/x errbuf
> 
> $10 = {op_context = 0x7fffe8000b40, flags = 0x10402, len = 0x60, buf = 0x0, data =
> 0x555, tag = 0x0, olen = 0x4, err = 0x109, prov_errno = 0xfffffef7, err_data = 0x20000,
> err_data_size = 0x0}
> 
> 
> 
> Notice that the ‘buf’ field is zero, so you can’t even find the data that was copied
> into the buffer up to the point of the truncation.  That ‘buf’ value is coming from
> sock_cq_report_error() in the following piece of code, it is taking the ‘if’ branch:
> 
> 
> 
>         if (entry->type == SOCK_PE_RX)
> 
>                 err_entry.buf = (void *) (uintptr_t) entry->pe.rx.rx_iov[0].iov.addr;
> 
>         else
> 
>                 err_entry.buf = (void *) (uintptr_t) entry-
> >pe.tx.tx_iov[0].src.iov.addr;
> 
> 
> 
> I suspect that this a bug, because in sock_pe_process_rx_send() I can dump the multi-
> recv buffer and see that bytes were in fact copied into the buffer up to the truncation
> point.   If ‘buf’ was set to the correct value, then at least I might be able to parse
> the beginning of the truncated message to find a message ID, which if I could then
> somehow also get the client’s fi_addr_t, would allow me to send a “please-send-msg-XXX-
> again” message to that client.
> 
> 
> 
> Note that the FI_MULTI_RECV flag is set in the fi_cq_err_entry, but because ‘buf’ is
> zero, there is no way to identify which multi-recv buffer is full!   In my case, I
> happen to use the op_context pointer as a place to record the address of the multi-recv
> buffer in which the message landed, so I know which buffer is full.   But this is still
> awkward, this completion error is scores of completions into the future from where I am
> currently reading with fi_cq_sread(), I haven’t even gotten to any completions which
> are in that buffer yet.   If I recycle the buffer immediately based on the
> FI_MULTI_RECV flag, I will destroy the data for all the upcoming completions that I
> haven’t yet read.

Multi-receive completions should always return the op_context associated with the posting of the multi-receive buffer.  The buf pointer is an offset into that buffer.  But the intent is that the app can identify the multi-receive buffer from op_context, not buf.

I'm not following the issue with previous/future completions.  If the app is actively using the buffer, I would look at maintaining a reference count for when the buffer can be reposted.  I'm not sure what the exact problem is or how libfabric can do anything different.  The use of the buffer is outside of its scope.

By the time FI_MULTI_RECV is set on a completion, no additional completions will be generated for that buffer.  At least that is how is should work.  If not, this sounds like a bug in the provider.  Error completions should be reported in order with non-error completions.

> So, how is one supposed to process this error?   Assuming ‘buf’ is corrected to point
> to the truncated message, I would first have to use it to deduce the identity of the
> multi-recv buffer that contains the error (if I wasn’t using op_context).  Then I would
> have to read and process completion events until I find an event that is in that same
> buffer.   Then I would continue to process more events until I see an event that is NOT
> in that buffer, proving that I must have therefore finished processing all events that
> were in the buffer.   At that point I could finally recycle the multi-recv  buffer per
> the FI_MULTI_RECV flag.    That assumes that there will be at least one completion
> event available for the buffer that follows the completed multi-recv buffer, but that
> might not be the case.   Or it might be that the message that was truncated was the
> ONLY message to be placed in the completed buffer, in which case I’ll have no way of
> detecting when it is safe to recycle that multi-recv buffer.    Hmmm, I guess if I keep
> a list of all the multi-recv buffers I’ve posted, and the order in which I’ve posted
> them, then I would know the address of the buffer following the completed buffer, and I
> could look for a completion event in that buffer.   But still, these seems really
> complicated.
> 
> 
> 
> If you haven’t already figured it out, I have been leading the conversation, I would
> like to suggest an alternate implementation that makes all these issues go away.   It
> might be considered a slight change to the libfabric rules, so I suppose it needs to be
> more widely debated, but I’d like to at least start the discussion.   Given that it is
> a change, other providers might also be affected, does anyone know if other providers
> implement this differently?
> 
> 
> 
> If you look at the call to sock_pe_report_error(), all of the fields that get reported
> in the fi_cq_err_entry come from the pe_entry for the message.   Rather than posting an
> error immediately via a ring buffer to the CQ, why not post the pe_entry, to the CQ as
> a normal completion event in the current time order, but have it be marked internally
> as an error?   The application would then read completions from the CQ as usual, and it
> wouldn’t see any error until it read the completion event for the truncated message
> itself.  So at that point the application’s call to fi_cq_sreadfrom() would fail with
> FI_EAVAIL.   It knows that the fi_cq_err_entry it next reads using fi_cq_readerr will
> apply to that CQ entry that just got the error, so the FI_SOURCE fi_addr_t that was
> returned by fi_cq_sreadfrom() is still available, the app knows the identity of the
> truncated message’s client.   And if ‘buf’ is fixed, it can parse the beginning of the
> truncated message to identify the particular message which it will ask the client to
> replay.   Finally, since the FI_MULTI_RECV bit won’t show up in a completion event
> until the fi_cq_err_entry has been read, the app knows that all the messages in the
> expended buffer have already been processed, it can always immediately recycle the
> buffer whenever it first sees the bit set, with a minimum of code.
> 
> 
> 
> This is the only way that I can see to prevent the client’s FI_SOURCE fi_addr_t from
> being lost during error processing.  For now I guess I will have to implement message
> timeouts in the clients, such that if a message is not acknowledged within some period
> of time, then it must have been truncated and discarded, and to send it again.   This
> is certainly a less desirable solution.

I thought about this option as well.  I don't like needing to carry state between two calls into the library.  That is problematic for any multi-threaded use case.  The best option is for fi_cq_err_entry to convey the source address somehow, even if we need to extend the structure.

- Sean