[libfabric-users] Tagged message question

Thu Jul 30 09:19:22 PDT 2020

> >If the transfer is large, it won't complete immediately and the buffering will occur
> on the send side.  Only the tagged information will be transferred.  Once the receiver
> posts a receive with the correct tag, the data will be retrieved.
> 
> 
> Does this mean that if I send a large tagged message (What is the threshold?, then only
> the tag info is sent initially and

This would be provider specific, but in general, yes, this would be the way to think of it. 

The situation is a little more complex in real life.  For example, the rxm provider implements 3 different protocols for transferring messages.  Small messages are sent using an eager protocol.  These are for messages that fit entirely within a single receive side buffer.  There is an environment variable to set this size; I think the default is 16k.

Medium sized messages (default up to 128k) use a segmentation and reassembly protocol.  Messages between the eager size and SAR size (again, controlled through variables) are broken into eager size chunks.

Larger messages use rendezvous, and will only send the headers.  The receiver will request the data when its ready.  I don't remember if the sender includes the first, say, 16k worth of data with the header or not.

> a) if a posted receive buffer with a tag match is already present, the receiving side
> will do an RMA get from the send side buffer to the receive buffer (and then both sides
> get a completion)

Correct.

Note that there is an open pull request that allows the use of RMA writes instead of reads for large message transfers.  That works by having the receiver send back a message to the original sender, who then performs the RMA write.  This is being added to help address situations where RMA write bandwidth outperforms RMA reads, even though it adds latency.  As proposed, RMA writes would replace reads, but I believe there's a chance we'll want both the read and write options available, with the ability to switch from read to write when cross yet another variable defined threshold.

> b) if no receive buffer is posted yet, the send side will not post a completion, but
> when the receive side does post a matching buffer, an RMA get will finish the job and
> both sides complete.

correct

> The reason I ask is because I handle larger messages (>threshold) using RMA - but if
> libfabric is doing this internally, I can drop this completely for tagged messages and
> simply use fi_trecv and fi_tsend regardless of size.

Yes, libfabric is doing this internally.

> what I really mean is - Is there any advantage to me sending the memory registration
> info in a small (unexpected) message and then doing an RMA get from the remote side
> myself when the tagged receive is setup?

Honestly, I think there's a disadvantage to implementing this above libfabric because it assumes a specific hardware implementation.

As mentioned, a provider may have multiple mechanisms for handling larger messages, which adjusts based on the underlying hardware support.  You would likely want to implement all of the same strategies, but being above libfabric puts your code at a disadvantage because you're more removed from what the underlying hardware is doing.

The mechanisms I mentioned above work well for rxm when layering over RDMA based hardware.  When layering rxm over tcp streams, the segmentation and reassembly mechanism doesn't make sense, and there's no fundamental difference between RMA read versus RMA write rendezvous.  So, we're exploring different optimizations in that case.  It would still have the eager/rendezvous concepts however.  Other providers, such as rxd and psm2, have similar eager/rendezvous protocols, but implement rendezvous differently.  Psm2 hardware, for example, is designed around tag matching, not RMA.

The real challenge is for the application to deal with messages larger than the max message size supported by the transports.  :)

- Sean