[ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order" option for updn routing

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Sun Jun 15 01:17:00 PDT 2008


Hi Al,

Al Chu wrote:
> Hey Sasha,
> 
> This is a conceptually simple option I've developed for updn routing.
> 
> Currently in updn routing, nodes/guids are routed on switches in a
> seemingly-random order, which I believe is due to internal data
> structure organization (i.e. cl_qmap_apply_func is called on
> port_guid_tbl) as well as how the fabric is scanned (it is logically
> scanned from a port perspective, but it may not be logical from a node
> perspective).  I had a hypothesis that this was leading to increased
> contention in the network for MPI.
> 
> For example, suppose we have 12 uplinks from a leaf switch to a spine
> switch.  If we want to send data from this leaf switch to node[13-24],
> the up links we will send on are pretty random. It's because:
> 
> A) node[13-24] are individually routed at seemingly-random points based
> on when they are called by cl_qmap_apply_func().
> 
> B) the ports chosen for routing are based on least used port usage.
> 
> C) least used port usage is based on whatever was routed earlier on.
> 
> So I developed this patch series, which supports an option called
> "guid_routing_order_file" which allows the user to input a file with a
> list of port_guids which will indicate the order in which guids are
> routed instead (naturally, those guids not listed are routed last).

Great idea!
I understand that this guid_routing_order_file is synchronized with
an MPI rank file, right? If not, then synchronizing them might give
even better results.

Another idea: OpenSM can create such file (list, doesn't have to be
actual file) automatically, just by checking topologically-adjacent
leaf switches and their HCAs.


> I list the port guids of the nodes of the cluster from node0 to nodeN, one
> per line in the file.  By listing the nodes in this order, I believe we
> could get less contention in the network.  In the example above, sending
> to node[13-24] should use all of the 12 uplinks, b/c the ports will be
> equally used b/c nodes[1-12] were routed beforehand in order.
> 
> The results from some tests are pretty impressive when I do this. LMC=0
> average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s
> when I use guid_routing_order.

Can you compare this to the fat-tree routing?
Conceptually, fat-tree is doing the same - it routes LIDs on nodes in
a topological order, so it would be interesting to see the comparison.
Also, fat-tree produces the guid order file automatically, but nobody
used it yet as an input to produce MPI rank file.

-- Yevgeny

> A variety of other positive performance
> increases were found when doing other tests, other MPIs, and other LMCs
> if anyone is interested.
> 
> BTW, I developed this patch series before your preserve-base-lid patch
> series.  It will 100% conflict with the preserve-base-lid patch series.
> I will fix this patch series once the preserve-base-lids patch series is
> committed to git.  I'm just looking for comments right now.
> 
> Al
> 




More information about the general mailing list