[ofiwg] send/recv "credits"

Wed Sep 24 10:54:52 PDT 2014

On 09/24/2014 01:35 PM, Hefty, Sean wrote:
> Copying ofiwg
>
>
>
>> Perhaps this is an ongoing discussion, but I don't see any
>> documentation or issues about the max number of pending
>> (non-completed) send or receive operations, or how to track them.
>> I see the issue on "flow control (#69)" - I am guessing that is not
>> this topic, but rather wire flow control?
>
> See https://github.com/ofiwg/libfabric/issues/10 for a related
> discussion that ties back into this.
>
> Also see https://github.com/ofiwg/libfabric/issues/31.
>
>> Anyhow, one of the differences between our hardware and IB hardware
>> is the way our send and receive queue entries work.  My
>> understanding is that each IB queue entry can take an SGL, where
>> each of ours simply takes an address and a length.
>>
>> This difference percolates up into the API to mean that for IB, the
>> max number of pending sends is independent of the number of SGEs in
>> each operation, so it makes sense for verbs to report the number of
>> queue entries available, and credit accounting is that each
>> send/receive consumes one entry.
>>
>> For usnic, the number of queue entries consumed is variable, based
>> on the number of SGEs in the send.  For us to implement the verbs
>> model on top of our HW required us to define a max SGE/send, and
>> divide the number of HW queue entries by that max, and report that
>> reduced number as the number of queue entries available.  This has
>> the effect of artificially lowering both our reported max SGE per
>> operation and also the reported queue depth.
>>
>> When implementing our own API for the hardware, we define each QP
>> (endpoint) to have a fixed number of credits (HW queue entries)
>> associated, and allow each send/recv operation to consume a
>> variable (but well defined) number of credits, which makes much
>> more efficient use of the queue entries.
>>
>> So, I'm hoping that libfabric will use a credit model that supports
>> this "variable # of credits per operation" approach, or something
>> equivalent. What's the current thinking on this?
>
> I agree that something is needed, but I'm not sure what exactly.  The
> association of an endpoint with a specific number of queue entries or
> credits appears to be needed, but also limiting.  Currently the data
> transfer APIs are allowed to return FI_EBUSY or FI_EAGAIN (or
> something like that) to indicate that a request cannot be queued by
> the provider.  But I agree than an app should have access to some
> sort of credit count.
>
> In my last response to Doug on issue 10 above, I suggested defining a
> value such as minimum_credit_count.  An app is guaranteed to be able
> to initiate this many transfers, but may be able to issue more.  For
> the case of usnic, there would be a couple of options.  One would be
> to report a max SGL size of 1, with min_credit_count equal to the
> queue depth.  A second would be to report a max SGL size of N, with
> min_credit_count equal to queue depth / N.  The latter option could
> support the app queuing more requests, with the provider returning
> FI_EAGAIN when the underlying queue was full.
>
> This seems reasonably simple for the app to use, but could leave some
> entries unused for apps that manage their own credits, since the
> provider must be prepared to handle the worst case (i.e. all
> transfers use the max SGL).

How about this:

An endpoint is created with a certain number of credits.  This depends 
on the provider.  The credits could be SGE entries as Jeff talks about, 
it could be WQE entries as you get with IB.  We report the maximum 
number of credits to the user either at endpoint creation time via the 
attribute struct, or later on a call to get_info or something to that 
effect.  And somewhere in there, we create an enum for the credit type, 
be that FI_CREDIT_SGE or FI_CREDIT_QUEUE_ENTRY, whatever.  Then the 
application will have (most) of the information it needs to properly do 
it's own accounting.  The one other thing that it will need, which we 
discussed in issue #10, is that the application will need to know if 
things like RDMA_WRITE_WITH_IMMEDIATE require one or two queue entries 
for the library to complete.  So a query of the device's capabilities, 
combined with a statement in the man pages of under exactly which 
scenarios the library will send a second message to complete what can't 
be done with the first message, would allow the application to know for 
sure how many credits each message will take.

But, if you want to make things easier, maybe we could enable a flag on 
an endpoint, so FI_EP_RETURN_CREDITS that would cause the provider to 
update the return code from all of the various send methods so that 
instead of being 0 on success, -1 on err with errno set, it could be -1 
on err with errno set, else success with the number indicating the 
number of credits used to send the message.  Then the application 
*could* be fully aware of all the things above, or it could just opt to 
let the provider tell it where it stands on credits with each send 
(which has the benefit of future proofing the application against subtle 
changes to how credits are counted in any provider).  This implies that 
we would also want to add a credits field to completions and let the 
user know with each completion how many credits were freed up.  If the 
user chooses not to get notifications on all completions, then we need 
to decide if we queue up credits from silent completions and on the 
first non-silent completion report all of them, or if we add a 
fi_endpoint_query_credits() call to get the current amount.

Anyway, just a thought.