[libfabric-users] Tagged message question

Thu Jul 30 12:37:21 PDT 2020

Thanks for the replies Sean - all of what you have written is fantastic. For the project I'm working on now that uses tagged messages almost exclusively, this means my code reduces to almost nothing, just some memory registration stuff to handle pinning of buffers and the mandatory completion queue polling.

For the HPX project that uses unexpected messages (think RPC invocations with arbitrarily large arguments potentially to be transferred) - I wonder if there is some way I can leverage the mechanism in place for the tagged messages.

If I want to (for example) migrate an object from one node to another, I send a small header with either

a) a message buffer containing the data contents (if small enough)

b) a list of RMA memory handle/ids that the receiver can GET (might be temp, pressure, velocity, etc array pointers for example)

c) a combination of the above if there are some small and some large objects to be sent

I guess that for complex messages with lots of large and small data items I am better off with my hand-rolled protocol - but it would be lovely if I could somehow reuse the libfabric internal protocols for each array. (I guess I could send an unexpected message with info, generate tags for each large array and get the receiving end to post the required tagged buffers - I'll ponder this).

Also - I use 64bit tags that are semi random (derived from pointer addresses) - does having very large 64bit tag values impact performance in any way? Should I try to use simple integer indexes? I don't know if you have some kind of map lookup to match tags to buffers or if there is another mechanism.

Many thanks for your ongoing assistance.

JB

________________________________
From: Hefty, Sean <sean.hefty at intel.com>
Sent: 30 July 2020 18:19:22
To: Biddiscombe, John A.; Chris Dolan
Cc: libfabric-users at lists.openfabrics.org
Subject: RE: [libfabric-users] Tagged message question

> >If the transfer is large, it won't complete immediately and the buffering will occur
> on the send side.  Only the tagged information will be transferred.  Once the receiver
> posts a receive with the correct tag, the data will be retrieved.
>
>
> Does this mean that if I send a large tagged message (What is the threshold?, then only
> the tag info is sent initially and

This would be provider specific, but in general, yes, this would be the way to think of it.

The situation is a little more complex in real life.  For example, the rxm provider implements 3 different protocols for transferring messages.  Small messages are sent using an eager protocol.  These are for messages that fit entirely within a single receive side buffer.  There is an environment variable to set this size; I think the default is 16k.

Medium sized messages (default up to 128k) use a segmentation and reassembly protocol.  Messages between the eager size and SAR size (again, controlled through variables) are broken into eager size chunks.

Larger messages use rendezvous, and will only send the headers.  The receiver will request the data when its ready.  I don't remember if the sender includes the first, say, 16k worth of data with the header or not.

> a) if a posted receive buffer with a tag match is already present, the receiving side
> will do an RMA get from the send side buffer to the receive buffer (and then both sides
> get a completion)

Correct.

Note that there is an open pull request that allows the use of RMA writes instead of reads for large message transfers.  That works by having the receiver send back a message to the original sender, who then performs the RMA write.  This is being added to help address situations where RMA write bandwidth outperforms RMA reads, even though it adds latency.  As proposed, RMA writes would replace reads, but I believe there's a chance we'll want both the read and write options available, with the ability to switch from read to write when cross yet another variable defined threshold.

> b) if no receive buffer is posted yet, the send side will not post a completion, but
> when the receive side does post a matching buffer, an RMA get will finish the job and
> both sides complete.

correct

> The reason I ask is because I handle larger messages (>threshold) using RMA - but if
> libfabric is doing this internally, I can drop this completely for tagged messages and
> simply use fi_trecv and fi_tsend regardless of size.

Yes, libfabric is doing this internally.

> what I really mean is - Is there any advantage to me sending the memory registration
> info in a small (unexpected) message and then doing an RMA get from the remote side
> myself when the tagged receive is setup?

Honestly, I think there's a disadvantage to implementing this above libfabric because it assumes a specific hardware implementation.

As mentioned, a provider may have multiple mechanisms for handling larger messages, which adjusts based on the underlying hardware support.  You would likely want to implement all of the same strategies, but being above libfabric puts your code at a disadvantage because you're more removed from what the underlying hardware is doing.

The mechanisms I mentioned above work well for rxm when layering over RDMA based hardware.  When layering rxm over tcp streams, the segmentation and reassembly mechanism doesn't make sense, and there's no fundamental difference between RMA read versus RMA write rendezvous.  So, we're exploring different optimizations in that case.  It would still have the eager/rendezvous concepts however.  Other providers, such as rxd and psm2, have similar eager/rendezvous protocols, but implement rendezvous differently.  Psm2 hardware, for example, is designed around tag matching, not RMA.

The real challenge is for the application to deal with messages larger than the max message size supported by the transports.  :)

- Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200730/c75428eb/attachment.htm>