[libfabric-users] An approach on Two Multi-process Applications Communication using RDM and RMA.

Isaac Nuñez isaacnez at outlook.com
Mon May 11 10:44:05 PDT 2020


The idea is one multi-node application (process-focused) handles data - each node operates independently; hence the RMA operations. The application viewed as a client uses MPI. Each root from each node assigned to the MPI job will communicate with one process in the other side (they could be in the same node or scattered). I already have reserved ports which are passed to libfabric through service, I am sure it is the same port on both sides (just as fabtests does it), but I still get the error 61.

So, my question goes more into the side if that "dynamic" behavior could be the reason why it fails? Is there a problem if several processes are creating AV?


  *   Isaac

________________________________
Von: Hefty, Sean <sean.hefty at intel.com>
Gesendet: Montag, 11. Mai 2020 18:48 Uhr
An: Isaac Nuñez <isaacnez at outlook.com>; libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Betreff: RE: An approach on Two Multi-process Applications Communication using RDM and RMA.

> I am having trouble creating communication between two applications. One is MPI-focused
> and the other one is process-focused, when I get to the initialization of the AV, the
> other side responds with an error 61, but the examples from fabtests seem to work (at
> least fi_rdm_rma_simple). What would be the best approach? I am using active endpoints,
> but it seems it is not the right technique. What are some of the main considerations I
> must keep while doing this?

There's not a lot of details on what's happening.  You need to ensure that all sides follow the same protocol for inserting addresses into the AV.  Fabtests does this with itself, which is why it works.  Fabtests carry the address in the first packet sent from a client to the server.  Many MPI apps rely on an out of band setup to initialize their AVs.  That's definitely easier, but requires that the nodes already have some level of communication between them, usually done through a process manager.

- Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200511/dceab869/attachment-0001.htm>


More information about the Libfabric-users mailing list