[ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order" option for updn routing

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Tue Jun 17 03:59:00 PDT 2008


Yevgeny Kliteynik wrote:
> Hi Al,
> 
> Al Chu wrote:
>> On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:
>>> Hey Yevgeny,
>>>  
>>> On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
>>>> Hi Al,
>>>>
>>>> Al Chu wrote:
>>>>> Hey Sasha,
>>>>>
>>>>> This is a conceptually simple option I've developed for updn routing.
>>>>>
>>>>> Currently in updn routing, nodes/guids are routed on switches in a
>>>>> seemingly-random order, which I believe is due to internal data
>>>>> structure organization (i.e. cl_qmap_apply_func is called on
>>>>> port_guid_tbl) as well as how the fabric is scanned (it is logically
>>>>> scanned from a port perspective, but it may not be logical from a node
>>>>> perspective).  I had a hypothesis that this was leading to increased
>>>>> contention in the network for MPI.
>>>>>
>>>>> For example, suppose we have 12 uplinks from a leaf switch to a spine
>>>>> switch.  If we want to send data from this leaf switch to node[13-24],
>>>>> the up links we will send on are pretty random. It's because:
>>>>>
>>>>> A) node[13-24] are individually routed at seemingly-random points 
>>>>> based
>>>>> on when they are called by cl_qmap_apply_func().
>>>>>
>>>>> B) the ports chosen for routing are based on least used port usage.
>>>>>
>>>>> C) least used port usage is based on whatever was routed earlier on.
>>>>>
>>>>> So I developed this patch series, which supports an option called
>>>>> "guid_routing_order_file" which allows the user to input a file with a
>>>>> list of port_guids which will indicate the order in which guids are
>>>>> routed instead (naturally, those guids not listed are routed last).
>>>> Great idea!
>>> Thanks.
>>>
>>>> I understand that this guid_routing_order_file is synchronized with
>>>> an MPI rank file, right? If not, then synchronizing them might give
>>>> even better results.
>>> Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
>>> responsible for MPI ranks, so I order the guids in my file according to
>>> how slurm is configured for chosing MPI ranks.  I will admit to being a
>>> novice to MPI's configuration (blindly accepting slurm MPI rankings).
>>> Is there an underlying file that MPI libs use for ranking knowledge?
>>
>> I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
>> can input a file to tell it how ranks should be assigned to nodes for
>> MPI.  I assume that's what you're talking about?
> 
> Yes, that is what I was talking about.
> There is a host file, where you list all the hosts that MPI should use,
> and in some MPIs there is also a way to specify the order of MPI ranks
> that would be assigned to processes (I'm not an MPI expert, so I'm not
> sure about the terminology that I use).
> I know that MVAPICH is using the host order when assigning ranks, so
> the order of the cluster nodes listed in host file is important.
> Not sure about OpenMPI.
> 
>>>> Another idea: OpenSM can create such file (list, doesn't have to be
>>>> actual file) automatically, just by checking topologically-adjacent
>>>> leaf switches and their HCAs.
>>> Definitely a good idea.  This patch set was just a "step one" kind of
>>> thing.
>>>
>>>>> I list the port guids of the nodes of the cluster from node0 to 
>>>>> nodeN, one
>>>>> per line in the file.  By listing the nodes in this order, I 
>>>>> believe we
>>>>> could get less contention in the network.  In the example above, 
>>>>> sending
>>>>> to node[13-24] should use all of the 12 uplinks, b/c the ports will be
>>>>> equally used b/c nodes[1-12] were routed beforehand in order.
>>>>>
>>>>> The results from some tests are pretty impressive when I do this. 
>>>>> LMC=0
>>>>> average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s
>>>>> when I use guid_routing_order.
>>>> Can you compare this to the fat-tree routing?  Conceptually, fat-tree
>>>> is doing the same - it routes LIDs on nodes in a topological order, so
>>>> it would be interesting to see the comparison.
>>> Actually I already did :-).  w/ LMC=0.
>>>
>>> updn default - 391.374 MB/s
>>> updn w/ guid_routing_order - 573.678 MB/s
>>> ftree - 579.603 MB/s
>>>
>>> I later discovered that one of the internal ports of the cluster I'm
>>> testing on was broken (sLB of a 288 port), and think that is the cause
>>> of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
>>> designed) seemed to be able to work around it properly, while updn (as
>>> currently implemented) couldn't.
>>>
>>> When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able
>>> to do better on some tests than ftree.  One example (I think these
>>> numbers are in microseconds.  Lower is better):
>>>
>>> Alltoall 16K packets
>>> ftree - 415490.6919
>>> updn normal (LMC=0) - 495460.5526
>>> updn w/ ordered routing (LMC=0) - 416562.7417
>>> updn w/ ordered routing (LMC=1) - 453153.7289
>>>  - this ^^^ result is quite odd.  Not sure why.
>>> updn w/ ordered routing (LMC=2) - 3660132.1530
>>>
>>> We are regularly debating what will be better overall at the end of the
>>> day.
>>>
>>>> Also, fat-tree produces the guid order file automatically, but nobody
>>>> used it yet as an input to produce MPI rank file.
>>> I didn't know about this option.  How do you do this (just skimmed the
>>> manpage, didn't see anything)? 
> 
> Right, it's missing there. I'll add this info.

Nope, it's there:

  "The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
   in the same directory where the OpenSM log resides. This ordering file provides
   the CN order that may be used to create efficient communication pattern, that
   will match the routing tables."

-- Yevgeny


> The file is /var/log/opensm-ftree-ca-order.dump.
> Small correction though - the file contains ordered list of HCA LIDs
> and their host names. It's not a problem to change it to have guids
> as well, but MPI doesn't need guids anyway.
> Note that the optimal order might be different depending on the current
> topology state and the location of the management node that runs OpenSM.
> 
>>> I know about the --cn_guid_file.  But
>>> since that file doesn't have to be ordered, that's why I created a
>>> different option (rather than have the cn_guid_file for both ftree and
>>> updn).
> 
> Right, the cn file doesn't have to be ordered - ftree will order it
> by itself. The ordering is by topology-adjacent leaf switches.
> 
> -- Yevgeny
> 
>>>
>>> Al
>>>
>>>> -- Yevgeny
>>>>
>>>>> A variety of other positive performance
>>>>> increases were found when doing other tests, other MPIs, and other 
>>>>> LMCs
>>>>> if anyone is interested.
>>>>>
>>>>> BTW, I developed this patch series before your preserve-base-lid patch
>>>>> series.  It will 100% conflict with the preserve-base-lid patch 
>>>>> series.
>>>>> I will fix this patch series once the preserve-base-lids patch 
>>>>> series is
>>>>> committed to git.  I'm just looking for comments right now.
>>>>>
>>>>> Al
>>>>>
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 




More information about the general mailing list