[libfabric-users] fi_rma_bw error

Niyaz Murshed Niyaz.Murshed at arm.com
Thu Sep 26 08:35:29 PDT 2024


Update:
Issue resolved. Problem was found to be slow NIC provider code in rdma-core.

From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Niyaz Murshed <Niyaz.Murshed at arm.com>
Date: Monday, September 23, 2024 at 9:24 AM
To: libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Cc: nd <nd at arm.com>
Subject: Re: [libfabric-users] fi_rma_bw error
Further debugging this, I see that the server (which accepts the WRITE) sends RNR NAK.
This shows that the CLIENT is sending WRITE request faster than SERVER can accept WRITE requests.


From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Niyaz Murshed <Niyaz.Murshed at arm.com>
Date: Thursday, September 19, 2024 at 10:33 AM
To: libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Subject: [libfabric-users] fi_rma_bw error
Hello,

I am seeing some issues with size more than 38000b when running fi_rma_bw test. Has something changed recently.

root at nvidia-grace-2-1:/# fi_rma_bw -s   192.168.100.200 192.168.100.100  -e msg   -o write -d roceP2p1s0 -S 32000 -p verbs
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
31k     20k     610m        0.44s   1454.91      21.99       0.05
root at nvidia-grace-2-1:/# fi_rma_bw -s   192.168.100.200 192.168.100.100  -e msg   -o write -d roceP2p1s0 -S 36000 -p verbs
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
35k     20k     686m        0.52s   1379.17      26.10       0.04
root at nvidia-grace-2-1:/# fi_rma_bw -s   192.168.100.200 192.168.100.100  -e msg   -o write -d roceP2p1s0 -S 38000 -p verbs
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
37k     20k     724m        0.56s   1366.78      27.80       0.04
root at nvidia-grace-2-1:/# fi_rma_bw -s   192.168.100.200 192.168.100.100  -e msg   -o write -d roceP2p1s0 -S 40000 -p verbs
[error] fabtests:common/shared.c:2995: cq_readerr 5 (Input/output error), provider errno: 2 (local QP operation error)




root at nvidia-grace-2-1:/# fi_rma_bw -s   192.168.100.200 192.168.100.100  -e msg   -o write -d roceP2p1s0 -p verbs
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
64      20k     1.2m        0.02s     66.06       0.97       1.03
256     20k     4.8m        0.02s    327.37       0.78       1.28
1k      20k     19m         0.02s   1306.79       0.78       1.28
4k      20k     78m         0.05s   1525.88       2.68       0.37
[error] fabtests:common/shared.c:2995: cq_readerr 5 (Input/output error), provider errno: 2 (local QP operation error)


Any suggestion where to look for error?

Regards,
Niyaz

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20240926/7929c9ed/attachment-0001.htm>


More information about the Libfabric-users mailing list