[libfabric-users] Not receiving messages from other ranks

Biddiscombe, John A. john.biddiscombe at cscs.ch
Fri Feb 12 13:29:36 PST 2021


Verified - tags are 64bits on tcp and on gni 0xaaaaaaaaaaaaaaa


I just tested on GNI and I get exactly the same problem, my snippet of log looks like this

<DEB> 0000279468 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI recv message buffer  <- 00 tag 0x00761ff800000000
<DEB> 0000279505 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI recv message buffer  <- 00 tag 0x00761ff800000003
<DEB> 0000279564 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI recv message buffer  <- 00 tag 0x00761ff800000004
<DEB> 0000279589 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI recv message buffer  <- 00 tag 0x00761ff800000001
<DEB> 0000279649 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI send message buffer  -> 00 tag 0x00761ff800000000
<DEB> 0000280477 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI send message buffer  -> 00 tag 0x00761ff800000003
<DEB> 0000280531 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI send message buffer  -> 00 tag 0x00761ff800000004
<DEB> 0000280583 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI send message buffer  -> 00 tag 0x00761ff800000001
<DEB> 0000280676 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           txcq MSG tagged send completion 0x81bbe0
<DEB> 0000280713 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761ff800000000 send
<DEB> 0000280795 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           txcq MSG tagged send completion 0x81bcb0
<DEB> 0000280803 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761ff800000003 send
<DEB> 0000280810 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           txcq MSG tagged send completion 0x8242b0
<DEB> 0000280817 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761ff800000004 send
<DEB> 0000280824 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           txcq MSG tagged send completion 0x824370
<DEB> 0000280830 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761ff800000001 send

when I use a separate endpoint for send/recv - 4 sends are matched with correct tags - the recvs are not matched. (I only add the endpoint address of the receive endpoint to the AV, not the send endpoint)

I am polling both tx/rx cq's


if I switch to a single endpoint for both - then everything works as expected using gni,sockets,tcp

<DEB> 0000246689 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI recv message buffer  <- 00 tag 0x00761f9800000000
<DEB> 0000246721 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI recv message buffer  <- 00 tag 0x00761f9800000003
<DEB> 0000246742 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI recv message buffer  <- 00 tag 0x00761f9800000004
<DEB> 0000246762 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI recv message buffer  <- 00 tag 0x00761f9800000001
<DEB> 0000246806 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI send message buffer  -> 00 tag 0x00761f9800000000
<DEB> 0000247215 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI send message buffer  -> 00 tag 0x00761f9800000003
<DEB> 0000247251 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI send message buffer  -> 00 tag 0x00761f9800000004
<DEB> 0000247293 0x2aaaaab1d340 cpu --- nid00986(0)   COMMUNI send message buffer  -> 00 tag 0x00761f9800000001
<DEB> 0000247382 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           txcq MSG tagged send completion 0x7e7660
<DEB> 0000247391 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761f9800000000 send
<DEB> 0000247399 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           rxcq MSG tagged recv completion 0x7e6340
<DEB> 0000247405 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761f9800000000 recv
<DEB> 0000247411 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           txcq MSG tagged send completion 0x7e7730
<DEB> 0000247418 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761f9800000003 send
<DEB> 0000247423 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           rxcq MSG tagged recv completion 0x7e6230
<DEB> 0000247430 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761f9800000003 recv
<DEB> 0000247435 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           txcq MSG tagged send completion 0x7eb9e0
<DEB> 0000247441 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761f9800000004 send
<DEB> 0000247446 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           rxcq MSG tagged recv completion 0x7e6050
<DEB> 0000247452 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761f9800000004 recv
<DEB> 0000247457 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           txcq MSG tagged send completion 0x7ebaa0
<DEB> 0000247464 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761f9800000001 send
<DEB> 0000247469 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL Completion           rxcq MSG tagged recv completion 0x7bf770
<DEB> 0000247475 0x2aaaaab1d340 cpu --- nid00986(0)   CONTROL set_ready            0x00761f9800000001 recv

now we have 4 matched sends and 4 recvs - this test is using 1 rank only (sending to itselff)

I must have some mistake in my setup of the endpoints, but I just don't know what it might be - is there an example anywhere that uses different tx/rx endpoints?

Many thanks for your patience

JB
________________________________
From: Hefty, Sean <sean.hefty at intel.com>
Sent: 12 February 2021 22:00:53
To: Biddiscombe, John A.; libfabric-users at lists.openfabrics.org
Subject: RE: Not receiving messages from other ranks

> I'm using 64bit tags and I get weird (=I can't explain it well) results if I put some
> info like rank in the tag using bitmasks - then the sockets version stops working (but
> I have no idea why making the tags more different/unique would be an issue (however,
> mismatched tags would explain the data fail when tests are run with sockets - if the
> wrong tag is being matched, we get errors, but if I change the tags, nothing matches -
> should I only use 32bit tags?).

Check the results from fi_getinfo.  There's a tag mask somewhere in the attributes that indicates which tags are valid.  I believe most providers are in the 60+ tag range, depending on other options, but I'm not sure about the full 64-bit range.

- Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20210212/f99e76ef/attachment.htm>


More information about the Libfabric-users mailing list