[libfabric-users] interesting first results of my benchmark

Wed May 15 03:23:52 PDT 2019

Hi everyone,

as some may remember I’m working on an any-to-any Benchmark for my Bachelor Thesis.
I’m now at a stage where I’m getting my first proper results and I’m a Little intrigued, yet I don’t quite 100% understand the results I am getting. 

A Little background Information:

Hardware:
	The network I‘m testing on  utilizes four 36 port switches. With one central switch connecting the 3 others via 12 physical links each(see right side of ´network topology.jpg´) utilizing 4xQDR to PCIe 2x8 (~32Gb/s theoretical effective throughput per Port).  Two of those switches branch out  to 16-20 nodes each(pn01-40) which are the nodes I’m working with. 

Software:
	The Benchmark I wrote utilizes one central node to synchronize Stages of all the other nodes(see ´Benchmarking_process_model.png´) divided into passive Server - active Client groups. between Clients and Servers  NxM connected endpoints are established. 

Benchmarking workflow:
- ´connect´: all N clients connect to all M servers establishing connected endpoints. 
- ´start´ signal all clients queue a defined number of fi_write() per endpoint and once a completion arrives the respective endpoint is looked up and a new fi_write is enqueued for that endpoint.
- ´checkpoint´ every ´dt´ seconds: Clients send a current snapshot of their completion count for each endpoint to the controller
- ´stop´ after ´t´ seconds: Clients and Server disconnect. Clients send the final sum of completions for each endpoint to the Controller

>From what I can tell my implementation should run completely asynchronous. Endpoints and Nodes should work independently and there should be no waiting for each other implementation wise, the only critical area is the endpoint lookup and requeueing of fi_write() which I made sure is quick enough and able to serve all endpoints in time. So the only  synchronization that should happen is during ´connect´, ´start´ and ´stop/checkpoint´ signal.

The results of this Setup:
- at all times between each endpoint and node we observe extremely similar completion counts with variations of only 0.01~0.0001%
- Good: when routing is optimal each node performs almost as good as if it was a 1x1 Benchmark maintaining vey acceptable throughput.
- Bad: if just two Nodes share the same port in the switches routing table all of the nodes and endpoints are slowed down. 

Example:
compare heatmap-healthy vs heatmap defector. The only difference is that I swapped out pn16 in healthy for pn17 in defector. 
ibdiag and ibcongest -C on 5x5 Setup revealed that pn17 and pn22 seem to share a physical link within the routing tables:
Stage1 with pn17: pn17 and pn22 appear to have halved bandwidth
Stage3 pn16 replaces pn17 -> full bandwidth for everyone
# SRC, SLID, DST, DLID, BW [GB/s], AGG-BW [GB/s]
STAGE:1
pn04/U1/P1, 10, pn17/mlx4_0/P1, 21, 0.40, 0.40
pn04/U1/P1, 10, pn18/mlx4_0/P1, 20, 0.80, 1.20
pn04/U1/P1, 10, pn19/mlx4_0/P1, 25, 0.80, 2.00
pn04/U1/P1, 10, pn21/mlx4_0/P1, 31, 0.80, 2.80
pn04/U1/P1, 10, pn22/mlx4_0/P1, 33, 0.40, 3.20
pn05/mlx4_0/P1, 14, pn17/mlx4_0/P1, 21, 0.40, 3.60
pn05/mlx4_0/P1, 14, pn18/mlx4_0/P1, 20, 0.80, 4.40
pn05/mlx4_0/P1, 14, pn19/mlx4_0/P1, 25, 0.80, 5.20
pn05/mlx4_0/P1, 14, pn21/mlx4_0/P1, 31, 0.80, 6.00
pn05/mlx4_0/P1, 14, pn22/mlx4_0/P1, 33, 0.40, 6.40
pn06/mlx4_0/P1, 15, pn17/mlx4_0/P1, 21, 0.40, 6.80
pn06/mlx4_0/P1, 15, pn18/mlx4_0/P1, 20, 0.80, 7.60
pn06/mlx4_0/P1, 15, pn19/mlx4_0/P1, 25, 0.80, 8.40
pn06/mlx4_0/P1, 15, pn21/mlx4_0/P1, 31, 0.80, 9.20
pn06/mlx4_0/P1, 15, pn22/mlx4_0/P1, 33, 0.40, 9.60
pn07/mlx4_0/P1, 23, pn17/mlx4_0/P1, 21, 0.40, 10.00
pn07/mlx4_0/P1, 23, pn18/mlx4_0/P1, 20, 0.80, 10.80
pn07/mlx4_0/P1, 23, pn19/mlx4_0/P1, 25, 0.80, 11.60
pn07/mlx4_0/P1, 23, pn21/mlx4_0/P1, 31, 0.80, 12.40
pn07/mlx4_0/P1, 23, pn22/mlx4_0/P1, 33, 0.40, 12.80
pn08/mlx4_0/P1, 17, pn17/mlx4_0/P1, 21, 0.40, 13.20
pn08/mlx4_0/P1, 17, pn18/mlx4_0/P1, 20, 0.80, 14.00
pn08/mlx4_0/P1, 17, pn19/mlx4_0/P1, 25, 0.80, 14.80
pn08/mlx4_0/P1, 17, pn21/mlx4_0/P1, 31, 0.80, 15.60
pn08/mlx4_0/P1, 17, pn22/mlx4_0/P1, 33, 0.40, 16.00
STAGE:2
pn04/U1/P1, 10, pn17/mlx4_0/P1, 21, 4.00, 4.00
STAGE:3
pn04/U1/P1, 10, pn16/mlx4_0/P1, 19, 0.80, 0.80
pn04/U1/P1, 10, pn18/mlx4_0/P1, 20, 0.80, 1.60
pn04/U1/P1, 10, pn19/mlx4_0/P1, 25, 0.80, 2.40
pn04/U1/P1, 10, pn21/mlx4_0/P1, 31, 0.80, 3.20
pn04/U1/P1, 10, pn22/mlx4_0/P1, 33, 0.80, 4.00
pn05/mlx4_0/P1, 14, pn16/mlx4_0/P1, 19, 0.80, 4.80
pn05/mlx4_0/P1, 14, pn18/mlx4_0/P1, 20, 0.80, 5.60
pn05/mlx4_0/P1, 14, pn19/mlx4_0/P1, 25, 0.80, 6.40
pn05/mlx4_0/P1, 14, pn21/mlx4_0/P1, 31, 0.80, 7.20
pn05/mlx4_0/P1, 14, pn22/mlx4_0/P1, 33, 0.80, 8.00
pn06/mlx4_0/P1, 15, pn16/mlx4_0/P1, 19, 0.80, 8.80
pn06/mlx4_0/P1, 15, pn18/mlx4_0/P1, 20, 0.80, 9.60
pn06/mlx4_0/P1, 15, pn19/mlx4_0/P1, 25, 0.80, 10.40
pn06/mlx4_0/P1, 15, pn21/mlx4_0/P1, 31, 0.80, 11.20
pn06/mlx4_0/P1, 15, pn22/mlx4_0/P1, 33, 0.80, 12.00
pn07/mlx4_0/P1, 23, pn16/mlx4_0/P1, 19, 0.80, 12.80
pn07/mlx4_0/P1, 23, pn18/mlx4_0/P1, 20, 0.80, 13.60
pn07/mlx4_0/P1, 23, pn19/mlx4_0/P1, 25, 0.80, 14.40
pn07/mlx4_0/P1, 23, pn21/mlx4_0/P1, 31, 0.80, 15.20
pn07/mlx4_0/P1, 23, pn22/mlx4_0/P1, 33, 0.80, 16.00
pn08/mlx4_0/P1, 17, pn16/mlx4_0/P1, 19, 0.80, 16.80
pn08/mlx4_0/P1, 17, pn18/mlx4_0/P1, 20, 0.80, 17.60
pn08/mlx4_0/P1, 17, pn19/mlx4_0/P1, 25, 0.80, 18.40
pn08/mlx4_0/P1, 17, pn21/mlx4_0/P1, 31, 0.80, 19.20
         pn08/mlx4_0/P1, 17, pn22/mlx4_0/P1, 33, 0.80, 20.00

However ibcongest(no -C) on the same setup:
Stage1 all nodes appear to have halved bandwidth
Stage3 pn16 replaces pn17 -> full bandwidth for everyone

# SRC, SLID, DST, DLID, BW [GB/s], AGG-BW [GB/s]
STAGE:1
pn04/U1/P1, 10, pn17/mlx4_0/P1, 21, 0.40, 0.40
pn04/U1/P1, 10, pn18/mlx4_0/P1, 20, 0.40, 0.80
pn04/U1/P1, 10, pn19/mlx4_0/P1, 25, 0.40, 1.20
pn04/U1/P1, 10, pn21/mlx4_0/P1, 31, 0.40, 1.60
pn04/U1/P1, 10, pn22/mlx4_0/P1, 33, 0.40, 2.00
pn05/mlx4_0/P1, 14, pn17/mlx4_0/P1, 21, 0.40, 2.40
pn05/mlx4_0/P1, 14, pn18/mlx4_0/P1, 20, 0.40, 2.80
pn05/mlx4_0/P1, 14, pn19/mlx4_0/P1, 25, 0.40, 3.20
pn05/mlx4_0/P1, 14, pn21/mlx4_0/P1, 31, 0.40, 3.60
pn05/mlx4_0/P1, 14, pn22/mlx4_0/P1, 33, 0.40, 4.00
pn06/mlx4_0/P1, 15, pn17/mlx4_0/P1, 21, 0.40, 4.40
pn06/mlx4_0/P1, 15, pn18/mlx4_0/P1, 20, 0.40, 4.80
pn06/mlx4_0/P1, 15, pn19/mlx4_0/P1, 25, 0.40, 5.20
pn06/mlx4_0/P1, 15, pn21/mlx4_0/P1, 31, 0.40, 5.60
pn06/mlx4_0/P1, 15, pn22/mlx4_0/P1, 33, 0.40, 6.00
pn07/mlx4_0/P1, 23, pn17/mlx4_0/P1, 21, 0.40, 6.40
pn07/mlx4_0/P1, 23, pn18/mlx4_0/P1, 20, 0.40, 6.80
pn07/mlx4_0/P1, 23, pn19/mlx4_0/P1, 25, 0.40, 7.20
pn07/mlx4_0/P1, 23, pn21/mlx4_0/P1, 31, 0.40, 7.60
pn07/mlx4_0/P1, 23, pn22/mlx4_0/P1, 33, 0.40, 8.00
pn08/mlx4_0/P1, 17, pn17/mlx4_0/P1, 21, 0.40, 8.40
pn08/mlx4_0/P1, 17, pn18/mlx4_0/P1, 20, 0.40, 8.80
pn08/mlx4_0/P1, 17, pn19/mlx4_0/P1, 25, 0.40, 9.20
pn08/mlx4_0/P1, 17, pn21/mlx4_0/P1, 31, 0.40, 9.60
pn08/mlx4_0/P1, 17, pn22/mlx4_0/P1, 33, 0.40, 10.00
STAGE:2
pn04/U1/P1, 10, pn17/mlx4_0/P1, 21, 4.00, 4.00
STAGE:3
pn04/U1/P1, 10, pn16/mlx4_0/P1, 19, 0.80, 0.80
pn04/U1/P1, 10, pn18/mlx4_0/P1, 20, 0.80, 1.60
pn04/U1/P1, 10, pn19/mlx4_0/P1, 25, 0.80, 2.40
pn04/U1/P1, 10, pn21/mlx4_0/P1, 31, 0.80, 3.20
pn04/U1/P1, 10, pn22/mlx4_0/P1, 33, 0.80, 4.00
pn05/mlx4_0/P1, 14, pn16/mlx4_0/P1, 19, 0.80, 4.80
pn05/mlx4_0/P1, 14, pn18/mlx4_0/P1, 20, 0.80, 5.60
pn05/mlx4_0/P1, 14, pn19/mlx4_0/P1, 25, 0.80, 6.40
pn05/mlx4_0/P1, 14, pn21/mlx4_0/P1, 31, 0.80, 7.20
pn05/mlx4_0/P1, 14, pn22/mlx4_0/P1, 33, 0.80, 8.00
pn06/mlx4_0/P1, 15, pn16/mlx4_0/P1, 19, 0.80, 8.80
pn06/mlx4_0/P1, 15, pn18/mlx4_0/P1, 20, 0.80, 9.60
pn06/mlx4_0/P1, 15, pn19/mlx4_0/P1, 25, 0.80, 10.40
pn06/mlx4_0/P1, 15, pn21/mlx4_0/P1, 31, 0.80, 11.20
pn06/mlx4_0/P1, 15, pn22/mlx4_0/P1, 33, 0.80, 12.00
pn07/mlx4_0/P1, 23, pn16/mlx4_0/P1, 19, 0.80, 12.80
pn07/mlx4_0/P1, 23, pn18/mlx4_0/P1, 20, 0.80, 13.60
pn07/mlx4_0/P1, 23, pn19/mlx4_0/P1, 25, 0.80, 14.40
pn07/mlx4_0/P1, 23, pn21/mlx4_0/P1, 31, 0.80, 15.20
pn07/mlx4_0/P1, 23, pn22/mlx4_0/P1, 33, 0.80, 16.00
pn08/mlx4_0/P1, 17, pn16/mlx4_0/P1, 19, 0.80, 16.80
pn08/mlx4_0/P1, 17, pn18/mlx4_0/P1, 20, 0.80, 17.60
pn08/mlx4_0/P1, 17, pn19/mlx4_0/P1, 25, 0.80, 18.40
pn08/mlx4_0/P1, 17, pn21/mlx4_0/P1, 31, 0.80, 19.20
pn08/mlx4_0/P1, 17, pn22/mlx4_0/P1, 33, 0.80, 20.00

They way I interpret these results is that there must be some form of underlying synchronization going on that I am not aware of and I have no idea whether this is the doing of libfabric or the InfiniBand protocol however since ibcongest without flow control appears to behave similarly(however not linear) to the benchmark I tend to assume that this is the doing of the InfiniBand protocol. So some questions arise: what is causing this synch? can I turn this synch off? Can I do it through libfabric? Is it possible to implement manual routing within libfabric? If anyone could share some insight on this issue I would be very grateful. 

regards,

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190515/b552aeae/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: network_topolgy.jpg
Type: image/jpeg
Size: 83850 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190515/b552aeae/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Benchmarking_process_model.png
Type: image/png
Size: 122534 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190515/b552aeae/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: heatmap-defektor.png
Type: image/png
Size: 27102 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190515/b552aeae/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: heatmap-healthy.png
Type: image/png
Size: 24676 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190515/b552aeae/attachment-0005.png>