[ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order" option for updn routing
Yevgeny Kliteynik
kliteyn at dev.mellanox.co.il
Tue Jun 17 03:59:00 PDT 2008
Yevgeny Kliteynik wrote:
> Hi Al,
>
> Al Chu wrote:
>> On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:
>>> Hey Yevgeny,
>>>
>>> On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
>>>> Hi Al,
>>>>
>>>> Al Chu wrote:
>>>>> Hey Sasha,
>>>>>
>>>>> This is a conceptually simple option I've developed for updn routing.
>>>>>
>>>>> Currently in updn routing, nodes/guids are routed on switches in a
>>>>> seemingly-random order, which I believe is due to internal data
>>>>> structure organization (i.e. cl_qmap_apply_func is called on
>>>>> port_guid_tbl) as well as how the fabric is scanned (it is logically
>>>>> scanned from a port perspective, but it may not be logical from a node
>>>>> perspective). I had a hypothesis that this was leading to increased
>>>>> contention in the network for MPI.
>>>>>
>>>>> For example, suppose we have 12 uplinks from a leaf switch to a spine
>>>>> switch. If we want to send data from this leaf switch to node[13-24],
>>>>> the up links we will send on are pretty random. It's because:
>>>>>
>>>>> A) node[13-24] are individually routed at seemingly-random points
>>>>> based
>>>>> on when they are called by cl_qmap_apply_func().
>>>>>
>>>>> B) the ports chosen for routing are based on least used port usage.
>>>>>
>>>>> C) least used port usage is based on whatever was routed earlier on.
>>>>>
>>>>> So I developed this patch series, which supports an option called
>>>>> "guid_routing_order_file" which allows the user to input a file with a
>>>>> list of port_guids which will indicate the order in which guids are
>>>>> routed instead (naturally, those guids not listed are routed last).
>>>> Great idea!
>>> Thanks.
>>>
>>>> I understand that this guid_routing_order_file is synchronized with
>>>> an MPI rank file, right? If not, then synchronizing them might give
>>>> even better results.
>>> Not quite sure what you mean by a MPI rank file. At LLNL, slurm is
>>> responsible for MPI ranks, so I order the guids in my file according to
>>> how slurm is configured for chosing MPI ranks. I will admit to being a
>>> novice to MPI's configuration (blindly accepting slurm MPI rankings).
>>> Is there an underlying file that MPI libs use for ranking knowledge?
>>
>> I spoke to one of our MPI guys. I wasn't aware that in some MPIs you
>> can input a file to tell it how ranks should be assigned to nodes for
>> MPI. I assume that's what you're talking about?
>
> Yes, that is what I was talking about.
> There is a host file, where you list all the hosts that MPI should use,
> and in some MPIs there is also a way to specify the order of MPI ranks
> that would be assigned to processes (I'm not an MPI expert, so I'm not
> sure about the terminology that I use).
> I know that MVAPICH is using the host order when assigning ranks, so
> the order of the cluster nodes listed in host file is important.
> Not sure about OpenMPI.
>
>>>> Another idea: OpenSM can create such file (list, doesn't have to be
>>>> actual file) automatically, just by checking topologically-adjacent
>>>> leaf switches and their HCAs.
>>> Definitely a good idea. This patch set was just a "step one" kind of
>>> thing.
>>>
>>>>> I list the port guids of the nodes of the cluster from node0 to
>>>>> nodeN, one
>>>>> per line in the file. By listing the nodes in this order, I
>>>>> believe we
>>>>> could get less contention in the network. In the example above,
>>>>> sending
>>>>> to node[13-24] should use all of the 12 uplinks, b/c the ports will be
>>>>> equally used b/c nodes[1-12] were routed beforehand in order.
>>>>>
>>>>> The results from some tests are pretty impressive when I do this.
>>>>> LMC=0
>>>>> average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s
>>>>> when I use guid_routing_order.
>>>> Can you compare this to the fat-tree routing? Conceptually, fat-tree
>>>> is doing the same - it routes LIDs on nodes in a topological order, so
>>>> it would be interesting to see the comparison.
>>> Actually I already did :-). w/ LMC=0.
>>>
>>> updn default - 391.374 MB/s
>>> updn w/ guid_routing_order - 573.678 MB/s
>>> ftree - 579.603 MB/s
>>>
>>> I later discovered that one of the internal ports of the cluster I'm
>>> testing on was broken (sLB of a 288 port), and think that is the cause
>>> of some of the slowdown w/ updn w/ guid_routing_order. So ftree (as
>>> designed) seemed to be able to work around it properly, while updn (as
>>> currently implemented) couldn't.
>>>
>>> When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able
>>> to do better on some tests than ftree. One example (I think these
>>> numbers are in microseconds. Lower is better):
>>>
>>> Alltoall 16K packets
>>> ftree - 415490.6919
>>> updn normal (LMC=0) - 495460.5526
>>> updn w/ ordered routing (LMC=0) - 416562.7417
>>> updn w/ ordered routing (LMC=1) - 453153.7289
>>> - this ^^^ result is quite odd. Not sure why.
>>> updn w/ ordered routing (LMC=2) - 3660132.1530
>>>
>>> We are regularly debating what will be better overall at the end of the
>>> day.
>>>
>>>> Also, fat-tree produces the guid order file automatically, but nobody
>>>> used it yet as an input to produce MPI rank file.
>>> I didn't know about this option. How do you do this (just skimmed the
>>> manpage, didn't see anything)?
>
> Right, it's missing there. I'll add this info.
Nope, it's there:
"The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
in the same directory where the OpenSM log resides. This ordering file provides
the CN order that may be used to create efficient communication pattern, that
will match the routing tables."
-- Yevgeny
> The file is /var/log/opensm-ftree-ca-order.dump.
> Small correction though - the file contains ordered list of HCA LIDs
> and their host names. It's not a problem to change it to have guids
> as well, but MPI doesn't need guids anyway.
> Note that the optimal order might be different depending on the current
> topology state and the location of the management node that runs OpenSM.
>
>>> I know about the --cn_guid_file. But
>>> since that file doesn't have to be ordered, that's why I created a
>>> different option (rather than have the cn_guid_file for both ftree and
>>> updn).
>
> Right, the cn file doesn't have to be ordered - ftree will order it
> by itself. The ordering is by topology-adjacent leaf switches.
>
> -- Yevgeny
>
>>>
>>> Al
>>>
>>>> -- Yevgeny
>>>>
>>>>> A variety of other positive performance
>>>>> increases were found when doing other tests, other MPIs, and other
>>>>> LMCs
>>>>> if anyone is interested.
>>>>>
>>>>> BTW, I developed this patch series before your preserve-base-lid patch
>>>>> series. It will 100% conflict with the preserve-base-lid patch
>>>>> series.
>>>>> I will fix this patch series once the preserve-base-lids patch
>>>>> series is
>>>>> committed to git. I'm just looking for comments right now.
>>>>>
>>>>> Al
>>>>>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list