[ofiwg] send/recv "credits"
Reese Faucette (rfaucett)
rfaucett at cisco.com
Wed Sep 24 11:29:31 PDT 2014
> -----Original Message-----
> From: Doug Ledford [mailto:dledford at redhat.com]
> Sent: Wednesday, September 24, 2014 10:55 AM
> To: Hefty, Sean; Reese Faucette (rfaucett); Jeff Squyres (jsquyres)
> Cc: ofiwg at lists.openfabrics.org
> Subject: Re: [ofiwg] send/recv "credits"
> On 09/24/2014 01:35 PM, Hefty, Sean wrote:
> > Copying ofiwg
> >> Perhaps this is an ongoing discussion, but I don't see any
> >> documentation or issues about the max number of pending
> >> (non-completed) send or receive operations, or how to track them.
> >> I see the issue on "flow control (#69)" - I am guessing that is not
> >> this topic, but rather wire flow control?
> > See https://github.com/ofiwg/libfabric/issues/10 for a related
> > discussion that ties back into this.
> > Also see https://github.com/ofiwg/libfabric/issues/31.
> >> Anyhow, one of the differences between our hardware and IB hardware
> >> is the way our send and receive queue entries work. My understanding
> >> is that each IB queue entry can take an SGL, where each of ours
> >> simply takes an address and a length.
> >> This difference percolates up into the API to mean that for IB, the
> >> max number of pending sends is independent of the number of SGEs in
> >> each operation, so it makes sense for verbs to report the number of
> >> queue entries available, and credit accounting is that each
> >> send/receive consumes one entry.
> >> For usnic, the number of queue entries consumed is variable, based on
> >> the number of SGEs in the send. For us to implement the verbs model
> >> on top of our HW required us to define a max SGE/send, and divide the
> >> number of HW queue entries by that max, and report that reduced
> >> number as the number of queue entries available. This has the effect
> >> of artificially lowering both our reported max SGE per operation and
> >> also the reported queue depth.
> >> When implementing our own API for the hardware, we define each QP
> >> (endpoint) to have a fixed number of credits (HW queue entries)
> >> associated, and allow each send/recv operation to consume a variable
> >> (but well defined) number of credits, which makes much more efficient
> >> use of the queue entries.
> >> So, I'm hoping that libfabric will use a credit model that supports
> >> this "variable # of credits per operation" approach, or something
> >> equivalent. What's the current thinking on this?
> > I agree that something is needed, but I'm not sure what exactly. The
> > association of an endpoint with a specific number of queue entries or
> > credits appears to be needed, but also limiting. Currently the data
> > transfer APIs are allowed to return FI_EBUSY or FI_EAGAIN (or
> > something like that) to indicate that a request cannot be queued by
> > the provider. But I agree than an app should have access to some sort
> > of credit count.
> > In my last response to Doug on issue 10 above, I suggested defining a
> > value such as minimum_credit_count. An app is guaranteed to be able
> > to initiate this many transfers, but may be able to issue more. For
> > the case of usnic, there would be a couple of options. One would be
> > to report a max SGL size of 1, with min_credit_count equal to the
> > queue depth. A second would be to report a max SGL size of N, with
> > min_credit_count equal to queue depth / N. The latter option could
> > support the app queuing more requests, with the provider returning
> > FI_EAGAIN when the underlying queue was full.
> > This seems reasonably simple for the app to use, but could leave some
> > entries unused for apps that manage their own credits, since the
> > provider must be prepared to handle the worst case (i.e. all transfers
> > use the max SGL).
> How about this:
> An endpoint is created with a certain number of credits. This depends on
> the provider. The credits could be SGE entries as Jeff talks about, it could be
> WQE entries as you get with IB. We report the maximum number of credits
> to the user either at endpoint creation time via the attribute struct, or later
> on a call to get_info or something to that effect. And somewhere in there,
> we create an enum for the credit type, be that FI_CREDIT_SGE or
> FI_CREDIT_QUEUE_ENTRY, whatever. Then the application will have (most)
> of the information it needs to properly do it's own accounting. The one
> other thing that it will need, which we discussed in issue #10, is that the
> application will need to know if things like RDMA_WRITE_WITH_IMMEDIATE
> require one or two queue entries for the library to complete. So a query of
> the device's capabilities, combined with a statement in the man pages of
> under exactly which scenarios the library will send a second message to
> complete what can't be done with the first message, would allow the
> application to know for sure how many credits each message will take.
> But, if you want to make things easier, maybe we could enable a flag on an
> endpoint, so FI_EP_RETURN_CREDITS that would cause the provider to
> update the return code from all of the various send methods so that instead
> of being 0 on success, -1 on err with errno set, it could be -1 on err with
> errno set, else success with the number indicating the number of credits
> used to send the message. Then the application
> *could* be fully aware of all the things above, or it could just opt to let the
> provider tell it where it stands on credits with each send (which has the
> benefit of future proofing the application against subtle changes to how
> credits are counted in any provider). This implies that we would also want
> to add a credits field to completions and let the user know with each
> completion how many credits were freed up. If the user chooses not to get
> notifications on all completions, then we need to decide if we queue up
> credits from silent completions and on the first non-silent completion
> report all of them, or if we add a
> fi_endpoint_query_credits() call to get the current amount.
> Anyway, just a thought.
Providing a mechanism for the app to efficiently query available credits is perhaps the easiest, along with the "max credits" a send/recv can consume. Reserving max_credits is much more palatable than dividing by max_credits, and that would work for us. This could even be collapsed into something like "fi_ep_ok_to_send(num_sends)" which returns true if the EP has space to post num_sends maximally sized sends.
In your opinion, would the apps that want to do their own credit management be content to let libfabric maintain the credit calculations, but make them queryable. Thus, the app could always do "if (fi_ep_get_send_credits() >= min_credits) ..." min_credits would be an attribute returned when ep is created. I think this is my current favorite approach. fi_ep_get_send_credits() would likely just be an inline that returns a value from the ep_fid.
I agree having post operations return the number of credits consumed and completions report credits returned is also viable, just seems a bit more complex both for app and lib.
Providers like sockets could set credits to 1, min_credits to 1, and just never reduce current credits so that the credit test would always succeed.
More information about the ofiwg