[libfabric-users] completion at remote end of fi_read(msg)

Wed Apr 1 09:20:57 PDT 2020

> When I want to send N arrays from A to another node B, a small message is sent with N
> RMA keys in it and the context is set to the object initiating the transfer on A, the
> receiving end says, ooh, N arrays to RMA and it starts doing RMA operations, when all N
> have completed, it sends an ACK to A and A then marks the arrays as available again so
> the app can assume the buffers can be overwritten. All good. If those N RMA read
> operations initiated by B triggered a completion at A, then I could skip the ACK. The N
> RMA operations could trigger counter events held by the context on A (the object that
> initiated the transfer) then when it got N of then, it would know it was done and could
> reuse the buffers.
> 
> This is what I meant by a per context counter (rather than a per endpoint counter,
> there might be lots of transfers started by A, but they happen via different objects
> (contexts).

I think you could achieve the desired result if a counter were associated with the memory region(s) backing the arrays.  Although the API defines this for RMA writes, it does not for reads.  I'm not sure any provider implements the write counter support anyway.

One of the issues with counting remotely issued operations is that there can be significant overhead if it is emulated in software.  Hardware support is really needed.  As an example, IB/RoCE/iWarp devices provide no notifications at the target of RDMA read or write operations.  In order to increment a counter at the target, a separate message must be sent.  From libfabric's perspective, it has no way of knowing if counters are being used at the target.  So, in order to support an API where counters *might* be used, it would need to incur this overhead on *all* RMA operations.

- Sean