[ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order" option for updn routing

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Tue Jun 17 01:17:01 PDT 2008


Hi Al,

Al Chu wrote:
> On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:
>> Hey Yevgeny,
>>  
>> On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
>>> Hi Al,
>>>
>>> Al Chu wrote:
>>>> Hey Sasha,
>>>>
>>>> This is a conceptually simple option I've developed for updn routing.
>>>>
>>>> Currently in updn routing, nodes/guids are routed on switches in a
>>>> seemingly-random order, which I believe is due to internal data
>>>> structure organization (i.e. cl_qmap_apply_func is called on
>>>> port_guid_tbl) as well as how the fabric is scanned (it is logically
>>>> scanned from a port perspective, but it may not be logical from a node
>>>> perspective).  I had a hypothesis that this was leading to increased
>>>> contention in the network for MPI.
>>>>
>>>> For example, suppose we have 12 uplinks from a leaf switch to a spine
>>>> switch.  If we want to send data from this leaf switch to node[13-24],
>>>> the up links we will send on are pretty random. It's because:
>>>>
>>>> A) node[13-24] are individually routed at seemingly-random points based
>>>> on when they are called by cl_qmap_apply_func().
>>>>
>>>> B) the ports chosen for routing are based on least used port usage.
>>>>
>>>> C) least used port usage is based on whatever was routed earlier on.
>>>>
>>>> So I developed this patch series, which supports an option called
>>>> "guid_routing_order_file" which allows the user to input a file with a
>>>> list of port_guids which will indicate the order in which guids are
>>>> routed instead (naturally, those guids not listed are routed last).
>>> Great idea!
>> Thanks.
>>
>>> I understand that this guid_routing_order_file is synchronized with
>>> an MPI rank file, right? If not, then synchronizing them might give
>>> even better results.
>> Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
>> responsible for MPI ranks, so I order the guids in my file according to
>> how slurm is configured for chosing MPI ranks.  I will admit to being a
>> novice to MPI's configuration (blindly accepting slurm MPI rankings).
>> Is there an underlying file that MPI libs use for ranking knowledge?
> 
> I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
> can input a file to tell it how ranks should be assigned to nodes for
> MPI.  I assume that's what you're talking about?

Yes, that is what I was talking about.
There is a host file, where you list all the hosts that MPI should use,
and in some MPIs there is also a way to specify the order of MPI ranks
that would be assigned to processes (I'm not an MPI expert, so I'm not
sure about the terminology that I use).
I know that MVAPICH is using the host order when assigning ranks, so
the order of the cluster nodes listed in host file is important.
Not sure about OpenMPI.

>>> Another idea: OpenSM can create such file (list, doesn't have to be
>>> actual file) automatically, just by checking topologically-adjacent
>>> leaf switches and their HCAs.
>> Definitely a good idea.  This patch set was just a "step one" kind of
>> thing.
>>
>>>> I list the port guids of the nodes of the cluster from node0 to nodeN, one
>>>> per line in the file.  By listing the nodes in this order, I believe we
>>>> could get less contention in the network.  In the example above, sending
>>>> to node[13-24] should use all of the 12 uplinks, b/c the ports will be
>>>> equally used b/c nodes[1-12] were routed beforehand in order.
>>>>
>>>> The results from some tests are pretty impressive when I do this. LMC=0
>>>> average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s
>>>> when I use guid_routing_order.
>>> Can you compare this to the fat-tree routing?  Conceptually, fat-tree
>>> is doing the same - it routes LIDs on nodes in a topological order, so
>>> it would be interesting to see the comparison.
>> Actually I already did :-).  w/ LMC=0.
>>
>> updn default - 391.374 MB/s
>> updn w/ guid_routing_order - 573.678 MB/s
>> ftree - 579.603 MB/s
>>
>> I later discovered that one of the internal ports of the cluster I'm
>> testing on was broken (sLB of a 288 port), and think that is the cause
>> of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
>> designed) seemed to be able to work around it properly, while updn (as
>> currently implemented) couldn't.
>>
>> When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able
>> to do better on some tests than ftree.  One example (I think these
>> numbers are in microseconds.  Lower is better):
>>
>> Alltoall 16K packets
>> ftree - 415490.6919
>> updn normal (LMC=0) - 495460.5526
>> updn w/ ordered routing (LMC=0) - 416562.7417
>> updn w/ ordered routing (LMC=1) - 453153.7289
>>  - this ^^^ result is quite odd.  Not sure why.
>> updn w/ ordered routing (LMC=2) - 3660132.1530
>>
>> We are regularly debating what will be better overall at the end of the
>> day.
>>
>>> Also, fat-tree produces the guid order file automatically, but nobody
>>> used it yet as an input to produce MPI rank file.
>> I didn't know about this option.  How do you do this (just skimmed the
>> manpage, didn't see anything)? 

Right, it's missing there. I'll add this info.
The file is /var/log/opensm-ftree-ca-order.dump.
Small correction though - the file contains ordered list of HCA LIDs
and their host names. It's not a problem to change it to have guids
as well, but MPI doesn't need guids anyway.
Note that the optimal order might be different depending on the current
topology state and the location of the management node that runs OpenSM.

>> I know about the --cn_guid_file.  But
>> since that file doesn't have to be ordered, that's why I created a
>> different option (rather than have the cn_guid_file for both ftree and
>> updn).

Right, the cn file doesn't have to be ordered - ftree will order it
by itself. The ordering is by topology-adjacent leaf switches.

-- Yevgeny

>>
>> Al
>>
>>> -- Yevgeny
>>>
>>>> A variety of other positive performance
>>>> increases were found when doing other tests, other MPIs, and other LMCs
>>>> if anyone is interested.
>>>>
>>>> BTW, I developed this patch series before your preserve-base-lid patch
>>>> series.  It will 100% conflict with the preserve-base-lid patch series.
>>>> I will fix this patch series once the preserve-base-lids patch series is
>>>> committed to git.  I'm just looking for comments right now.
>>>>
>>>> Al
>>>>




More information about the general mailing list