[ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order"option for updn routing

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Tue Jun 17 01:45:16 PDT 2008


Hi Yiftah,

Yiftah Shahar wrote:
> Al, Yevgeny,
> 
>>>> I understand that this guid_routing_order_file is synchronized
> with
>>>> an MPI rank file, right? If not, then synchronizing them might give
>>>> even better results.
>>> Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
>>> responsible for MPI ranks, so I order the guids in my file according to
>>> how slurm is configured for chosing MPI ranks.  I will admit to being a
>>> novice to MPI's configuration (blindly accepting slurm MPI rankings).
>>> Is there an underlying file that MPI libs use for ranking knowledge?
>> I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
>> can input a file to tell it how ranks should be assigned to nodes for
>> MPI.  I assume that's what you're talking about?
>>
>> Al
> Upcoming Open MPI 1.3 will have such capabilities of rank placement in a
> specific node and specific CPU, we will also have some decisions
> settings how to communicate with different HCAs in multi HCAs node (we
> also have these capabilities in VLT-MPI for more then 2 years now but it
> is going into EOL stage...).
> 
> I think that more important then rank placement is communication pattern
> (i.e. some ranks communicate a lot and some does not send a single
> message) and this is far more complicated to do.

Both are important.
In routing we are dealing with congestion when there is some
communication that involves many nodes. However, the communication
is usually not random - it has a pattern, and this pattern is
affected by ranks.

In some patterns (such as "shift") all the nodes are sending something
at every pattern stage, in others (such as "recursive doubling")
some nodes send all the time, and others rarely.
In addition to that, there are optimizations that reduce IB
communication even more by having mpi processes on the same host
communicate in shared memory, and then have a single "representant"
for IB for all the processes. I think that MVAPICH1 and OpenMPI
are doing this.

However, doesn't matter how optimized the pattern will be,
in the end it has to transmit something on the wire, so if OpenSM
won't produce a balanced routing, you might get a single congested
wire that will delay each and every stage of the MPI communication
pattern.

Theoretically, the best result could be achieved if OpenSM
and MPI would work together - OpenSM would produce some kind
of list that would describe the topology order of the nodes,
and MPI would somehow use this info when assigning ranks.

-- Yevgeny

> Yiftah
> 
> 
>> -----Original Message-----
>> From: general-bounces at lists.openfabrics.org [mailto:general-
>> bounces at lists.openfabrics.org] On Behalf Of Al Chu
>> Sent: Monday, June 16, 2008 23:09
>> To: kliteyn at dev.mellanox.co.il
>> Cc: OpenIB
>> Subject: Re: [ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-
>> order"option for updn routing
>>
>> On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:
>>> Hey Yevgeny,
>>>
>>> On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
>>>> Hi Al,
>>>>
>>>> Al Chu wrote:
>>>>> Hey Sasha,
>>>>>
>>>>> This is a conceptually simple option I've developed for updn
> routing.
>>>>> Currently in updn routing, nodes/guids are routed on switches in
> a
>>>>> seemingly-random order, which I believe is due to internal data
>>>>> structure organization (i.e. cl_qmap_apply_func is called on
>>>>> port_guid_tbl) as well as how the fabric is scanned (it is
> logically
>>>>> scanned from a port perspective, but it may not be logical from
> a
>> node
>>>>> perspective).  I had a hypothesis that this was leading to
> increased
>>>>> contention in the network for MPI.
>>>>>
>>>>> For example, suppose we have 12 uplinks from a leaf switch to a
>> spine
>>>>> switch.  If we want to send data from this leaf switch to
> node[13-
>> 24],
>>>>> the up links we will send on are pretty random. It's because:
>>>>>
>>>>> A) node[13-24] are individually routed at seemingly-random
> points
>> based
>>>>> on when they are called by cl_qmap_apply_func().
>>>>>
>>>>> B) the ports chosen for routing are based on least used port
> usage.
>>>>> C) least used port usage is based on whatever was routed earlier
> on.
>>>>> So I developed this patch series, which supports an option
> called
>>>>> "guid_routing_order_file" which allows the user to input a file
> with
>> a
>>>>> list of port_guids which will indicate the order in which guids
> are
>>>>> routed instead (naturally, those guids not listed are routed
> last).
>>>> Great idea!
>>> Thanks.
>>>
>>>> I understand that this guid_routing_order_file is synchronized
> with
>>>> an MPI rank file, right? If not, then synchronizing them might
> give
>>>> even better results.
>>> Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
>>> responsible for MPI ranks, so I order the guids in my file according
> to
>>> how slurm is configured for chosing MPI ranks.  I will admit to
> being a
>>> novice to MPI's configuration (blindly accepting slurm MPI
> rankings).
>>> Is there an underlying file that MPI libs use for ranking knowledge?
>> I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
>> can input a file to tell it how ranks should be assigned to nodes for
>> MPI.  I assume that's what you're talking about?
>>
>> Al
>>
>>>> Another idea: OpenSM can create such file (list, doesn't have to
> be
>>>> actual file) automatically, just by checking
> topologically-adjacent
>>>> leaf switches and their HCAs.
>>> Definitely a good idea.  This patch set was just a "step one" kind
> of
>>> thing.
>>>
>>>>> I list the port guids of the nodes of the cluster from node0 to
>> nodeN, one
>>>>> per line in the file.  By listing the nodes in this order, I
> believe
>> we
>>>>> could get less contention in the network.  In the example above,
>> sending
>>>>> to node[13-24] should use all of the 12 uplinks, b/c the ports
> will
>> be
>>>>> equally used b/c nodes[1-12] were routed beforehand in order.
>>>>>
>>>>> The results from some tests are pretty impressive when I do
> this.
>> LMC=0
>>>>> average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678
> MB/s
>>>>> when I use guid_routing_order.
>>>> Can you compare this to the fat-tree routing?  Conceptually,
> fat-tree
>>>> is doing the same - it routes LIDs on nodes in a topological
> order, so
>>>> it would be interesting to see the comparison.
>>> Actually I already did :-).  w/ LMC=0.
>>>
>>> updn default - 391.374 MB/s
>>> updn w/ guid_routing_order - 573.678 MB/s
>>> ftree - 579.603 MB/s
>>>
>>> I later discovered that one of the internal ports of the cluster I'm
>>> testing on was broken (sLB of a 288 port), and think that is the
> cause
>>> of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
>>> designed) seemed to be able to work around it properly, while updn
> (as
>>> currently implemented) couldn't.
>>>
>>> When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were
> able
>>> to do better on some tests than ftree.  One example (I think these
>>> numbers are in microseconds.  Lower is better):
>>>
>>> Alltoall 16K packets
>>> ftree - 415490.6919
>>> updn normal (LMC=0) - 495460.5526
>>> updn w/ ordered routing (LMC=0) - 416562.7417
>>> updn w/ ordered routing (LMC=1) - 453153.7289
>>>  - this ^^^ result is quite odd.  Not sure why.
>>> updn w/ ordered routing (LMC=2) - 3660132.1530
>>>
>>> We are regularly debating what will be better overall at the end of
> the
>>> day.
>>>
>>>> Also, fat-tree produces the guid order file automatically, but
> nobody
>>>> used it yet as an input to produce MPI rank file.
>>> I didn't know about this option.  How do you do this (just skimmed
> the
>>> manpage, didn't see anything)?  I know about the --cn_guid_file.
> But
>>> since that file doesn't have to be ordered, that's why I created a
>>> different option (rather than have the cn_guid_file for both ftree
> and
>>> updn).
>>>
>>> Al
>>>
>>>> -- Yevgeny
>>>>
>>>>> A variety of other positive performance
>>>>> increases were found when doing other tests, other MPIs, and
> other
>> LMCs
>>>>> if anyone is interested.
>>>>>
>>>>> BTW, I developed this patch series before your preserve-base-lid
>> patch
>>>>> series.  It will 100% conflict with the preserve-base-lid patch
>> series.
>>>>> I will fix this patch series once the preserve-base-lids patch
>> series is
>>>>> committed to git.  I'm just looking for comments right now.
>>>>>
>>>>> Al
>>>>>
>> --
>> Albert Chu
>> chu11 at llnl.gov
>> 925-422-5311
>> Computer Scientist
>> High Performance Systems Division
>> Lawrence Livermore National Laboratory
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-
>> general
> 




More information about the general mailing list