[libfabric-users] problems with FI_MULTI_RECV buffer errors in the sockets provider (and perhaps others?)
Kevan Rehm
krehm at cray.com
Fri Apr 19 14:27:55 PDT 2019
Greetings,
I’ve been debugging my way through the sockets provider for the last few days because I am having problems using FI_MULTI_RECV buffers. The code seems to be doing what the man pages say it should do, but I can’t see how to recover from RX completion queue errors when they occur. For this discussion I am just referencing difficulties processing a FI_ETRUNC error, I haven’t yet considered other errors. And I haven’t yet tried other providers like TCP+RDM, that’s probably the next step, but in the meantime I’d like to have folks be aware of this.
I know about fi_setopt(…, FI_OPT_MIN_MULTI_RECV,…) command, but that doesn’t completely solve my problem, it is only useful for avoiding message truncation if all of your messages tend to be the same size, and you can predict the largest message you might ever receive, and then set the contents of the function’s ‘opt_val’ parameter to that size. And then in order to make multi-recv buffers efficient, the buffers should be many times that size. That’s not always possible.
First mystery: I have many FI_MULTI_RECV buffers queued on the server’s endpoint, yet the sockets provider will truncate a message in order to fit it into the multi-recv buffer at the head of the list. Why deliberately destroy a message when there are follow-on buffers large enough to hold that message? The code could instead create a FI_MULTI_RECV completion event for the current buffer, then place the new message in the next buffer and avoid any truncation. Is there something I am missing here?
The fi_cq_err_entry struct for the error is produced by sock_cq_report_error(), which is called by sock_pe_report_rx_error(), which is called from sock_pe_process_rx_send(). Note that the fi_cq_err_entry structure does not contain a fi_addr_t field, so the FI_SOURCE address of the client who sent the truncated message is lost, there is no way to identify the affected client and send it a “please-send-msg-XXX-again” message. I don’t see any solution for this.
There is nothing within the fi_cq_err_entry to identify which message from the client was lost either. Here is an example of an error about to be posted on the socket completion queue’s ring buffer by sock_cq_report_error(), both in decimal and hex format. The flags field corresponds to FI_MULTI_RECV|FI_READ|FI_MSG. In my case I am not using ‘data’ or ‘tag’, many messages have length of 96, there is nothing unique here to identify the message.
(gdb) p errbuf
$9 = {op_context = 0x7fffe8000b40, flags = 66562, len = 96, buf = 0x0, data = 1365, tag = 0, olen = 4, err = 265, prov_errno = -265, err_data = 0x20000, err_data_size = 0}
(gdb) p/x errbuf
$10 = {op_context = 0x7fffe8000b40, flags = 0x10402, len = 0x60, buf = 0x0, data = 0x555, tag = 0x0, olen = 0x4, err = 0x109, prov_errno = 0xfffffef7, err_data = 0x20000, err_data_size = 0x0}
Notice that the ‘buf’ field is zero, so you can’t even find the data that was copied into the buffer up to the point of the truncation. That ‘buf’ value is coming from sock_cq_report_error() in the following piece of code, it is taking the ‘if’ branch:
if (entry->type == SOCK_PE_RX)
err_entry.buf = (void *) (uintptr_t) entry->pe.rx.rx_iov[0].iov.addr;
else
err_entry.buf = (void *) (uintptr_t) entry->pe.tx.tx_iov[0].src.iov.addr;
I suspect that this a bug, because in sock_pe_process_rx_send() I can dump the multi-recv buffer and see that bytes were in fact copied into the buffer up to the truncation point. If ‘buf’ was set to the correct value, then at least I might be able to parse the beginning of the truncated message to find a message ID, which if I could then somehow also get the client’s fi_addr_t, would allow me to send a “please-send-msg-XXX-again” message to that client.
Note that the FI_MULTI_RECV flag is set in the fi_cq_err_entry, but because ‘buf’ is zero, there is no way to identify which multi-recv buffer is full! In my case, I happen to use the op_context pointer as a place to record the address of the multi-recv buffer in which the message landed, so I know which buffer is full. But this is still awkward, this completion error is scores of completions into the future from where I am currently reading with fi_cq_sread(), I haven’t even gotten to any completions which are in that buffer yet. If I recycle the buffer immediately based on the FI_MULTI_RECV flag, I will destroy the data for all the upcoming completions that I haven’t yet read.
So, how is one supposed to process this error? Assuming ‘buf’ is corrected to point to the truncated message, I would first have to use it to deduce the identity of the multi-recv buffer that contains the error (if I wasn’t using op_context). Then I would have to read and process completion events until I find an event that is in that same buffer. Then I would continue to process more events until I see an event that is NOT in that buffer, proving that I must have therefore finished processing all events that were in the buffer. At that point I could finally recycle the multi-recv buffer per the FI_MULTI_RECV flag. That assumes that there will be at least one completion event available for the buffer that follows the completed multi-recv buffer, but that might not be the case. Or it might be that the message that was truncated was the ONLY message to be placed in the completed buffer, in which case I’ll have no way of detecting when it is safe to recycle that multi-recv buffer. Hmmm, I guess if I keep a list of all the multi-recv buffers I’ve posted, and the order in which I’ve posted them, then I would know the address of the buffer following the completed buffer, and I could look for a completion event in that buffer. But still, these seems really complicated.
If you haven’t already figured it out, I have been leading the conversation, I would like to suggest an alternate implementation that makes all these issues go away. It might be considered a slight change to the libfabric rules, so I suppose it needs to be more widely debated, but I’d like to at least start the discussion. Given that it is a change, other providers might also be affected, does anyone know if other providers implement this differently?
If you look at the call to sock_pe_report_error(), all of the fields that get reported in the fi_cq_err_entry come from the pe_entry for the message. Rather than posting an error immediately via a ring buffer to the CQ, why not post the pe_entry, to the CQ as a normal completion event in the current time order, but have it be marked internally as an error? The application would then read completions from the CQ as usual, and it wouldn’t see any error until it read the completion event for the truncated message itself. So at that point the application’s call to fi_cq_sreadfrom() would fail with FI_EAVAIL. It knows that the fi_cq_err_entry it next reads using fi_cq_readerr will apply to that CQ entry that just got the error, so the FI_SOURCE fi_addr_t that was returned by fi_cq_sreadfrom() is still available, the app knows the identity of the truncated message’s client. And if ‘buf’ is fixed, it can parse the beginning of the truncated message to identify the particular message which it will ask the client to replay. Finally, since the FI_MULTI_RECV bit won’t show up in a completion event until the fi_cq_err_entry has been read, the app knows that all the messages in the expended buffer have already been processed, it can always immediately recycle the buffer whenever it first sees the bit set, with a minimum of code.
This is the only way that I can see to prevent the client’s FI_SOURCE fi_addr_t from being lost during error processing. For now I guess I will have to implement message timeouts in the clients, such that if a message is not acknowledged within some period of time, then it must have been truncated and discarded, and to send it again. This is certainly a less desirable solution.
Comments welcome,
Kevan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190419/c52378f9/attachment-0001.html>
More information about the Libfabric-users
mailing list