[libfabric-users] fi_rma_bw error
Niyaz Murshed
Niyaz.Murshed at arm.com
Thu Sep 26 08:35:29 PDT 2024
Update:
Issue resolved. Problem was found to be slow NIC provider code in rdma-core.
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Niyaz Murshed <Niyaz.Murshed at arm.com>
Date: Monday, September 23, 2024 at 9:24 AM
To: libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Cc: nd <nd at arm.com>
Subject: Re: [libfabric-users] fi_rma_bw error
Further debugging this, I see that the server (which accepts the WRITE) sends RNR NAK.
This shows that the CLIENT is sending WRITE request faster than SERVER can accept WRITE requests.
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Niyaz Murshed <Niyaz.Murshed at arm.com>
Date: Thursday, September 19, 2024 at 10:33 AM
To: libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Subject: [libfabric-users] fi_rma_bw error
Hello,
I am seeing some issues with size more than 38000b when running fi_rma_bw test. Has something changed recently.
root at nvidia-grace-2-1:/# fi_rma_bw -s 192.168.100.200 192.168.100.100 -e msg -o write -d roceP2p1s0 -S 32000 -p verbs
bytes iters total time MB/sec usec/xfer Mxfers/sec
31k 20k 610m 0.44s 1454.91 21.99 0.05
root at nvidia-grace-2-1:/# fi_rma_bw -s 192.168.100.200 192.168.100.100 -e msg -o write -d roceP2p1s0 -S 36000 -p verbs
bytes iters total time MB/sec usec/xfer Mxfers/sec
35k 20k 686m 0.52s 1379.17 26.10 0.04
root at nvidia-grace-2-1:/# fi_rma_bw -s 192.168.100.200 192.168.100.100 -e msg -o write -d roceP2p1s0 -S 38000 -p verbs
bytes iters total time MB/sec usec/xfer Mxfers/sec
37k 20k 724m 0.56s 1366.78 27.80 0.04
root at nvidia-grace-2-1:/# fi_rma_bw -s 192.168.100.200 192.168.100.100 -e msg -o write -d roceP2p1s0 -S 40000 -p verbs
[error] fabtests:common/shared.c:2995: cq_readerr 5 (Input/output error), provider errno: 2 (local QP operation error)
root at nvidia-grace-2-1:/# fi_rma_bw -s 192.168.100.200 192.168.100.100 -e msg -o write -d roceP2p1s0 -p verbs
bytes iters total time MB/sec usec/xfer Mxfers/sec
64 20k 1.2m 0.02s 66.06 0.97 1.03
256 20k 4.8m 0.02s 327.37 0.78 1.28
1k 20k 19m 0.02s 1306.79 0.78 1.28
4k 20k 78m 0.05s 1525.88 2.68 0.37
[error] fabtests:common/shared.c:2995: cq_readerr 5 (Input/output error), provider errno: 2 (local QP operation error)
Any suggestion where to look for error?
Regards,
Niyaz
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20240926/7929c9ed/attachment-0001.htm>
More information about the Libfabric-users
mailing list