[ofiwg] send/recv "credits"

Mon Sep 29 11:29:05 PDT 2014

On Mon, Sep 29, 2014 at 05:51:51PM +0000, Hefty, Sean wrote:
> > Then you go from
> > 
> > if (have_sent < can_send) {
> >   post()
> > }
> > 
> > To:
> > 
> > if (failed_send) {
> >   if (post(failed_send) == EGAIN)
> >       return ...
> > 
> > if (post(cur_send) == EGAIN)
> >   failed_send = cur_send
> 
> EAGAIN behaves similar to what an app would need to do if they were
> using sockets.

I recently put together a N:N messaging protocol scheme using SCTP -
which is a better analog than TCP sockets because it is message based,
has hidden flow control, and it uses EAGAIN and poll, or rather it
could:

At the end of the day the complexity in dealing with messaging became
so high (the protocol had an unsolicited message element) the only
sane solution was to introduce application end-to-end credit and
*explicit* kernel buffer sendq/recvq allocation (which SCTP allows).

This allowed a basic approach where processing a message could safely
generate a new reply message that could be guarenteed to be queued at
the kernel, so send() would never return EAGAIN.

There were several very ugly problems with the SCTP kernel API - SCTP
is a multi-ended reliable socket with hidden flow control. It uses a
single unified kernel send buffer that is not partitioned per
end-point.

A major problem is to avoid saturating the kernel send Q with too many
messages to a stalled end point. This would block communication with
all end points. The kernel provides no way to tell if an end point is
making progress, so the only solution is to manually partition the
unified send q with explicit user end-to-end credits.

In an IB verbs context this problem is solvable by directly monitoring
the completion Q - one techinque I have used before is to queue but
not even start to execute recv completions until I know sends for
those recvs will complete. The recv is taken off the CQ, embedded
remote credit grants processed, but the message is held until the 'if
(have_sent < can_sent)' test would succeed. Again this avoids having
to unwind processing, or unboundedly buffer, because a send could not
progress. Ultimately this scheme becomes bounded in memory because RQ
entries sitting on the internal queue do not generate a credit grant
back to the far side, so the system self-limits to its natural
messaging rate.

Bearing in mind, I am talking about async event driven apps, so
blocking at the send is not an option.

This example shows that hiding credits does not necessarily help apps
at all. In my case the app credit scheme is not required for correct
operation of the transport, but is required to bound memory
consumption by the app and to guarentee progress without huge
complexity.

EAGAIN works OK with 1:1 stream sockets that don't have a strong RPC
kind of semantic, or have an expectation that inbound messages
synchrnously generate outbound replies and the requirement for
forward progress is fairly simple. N:N and more complex messaging
ordering starts to suck in that environment.

> Independent from EAGAIN, Does the op_size / iov_size / op_alignment
> proposal work for apps that want to track send queue usage separate
> from the provider's tracking?

I didn't follow it too closely, sorry.  How does an app adapt a
provider that is telling it to use sge entries to work with a wire
protocol that is defined in terms of wqes?

> > I'm sort of confused here, introducing hidden flow control is a
> > It doesn't seem something like that could exist without a hidden
> > on-the-wire flow control scheme of some sort (like RNR ACK).
> > 
> > We already basically have that requirement for many SRQ and XRC type
> > applications, so it isn't really a new thing.
> 
> I didn't think SRQ or XRC protect against overrunning the remote CQ.

I ment in the sense you can't realistically deploy any of those schemes
to oversubscribe the RQ without also coupling them with RNR ACK.

The remote CQ doesn't overflow because every SQE and RQE is still
guarenteed by the app to have an available CQE before it is posted. So
you are guarenteed to hit RQ exhaustion before you hit CQ exhaustion.

Jason