[ofa-general] Questions Concerning a 3D Torus

Wed Oct 15 15:36:08 PDT 2008

Hello,

I have a number of questions related to the construction and operation
of a 3D torus with Infiniband.

We would like to create an IB network arranged as a 3D mesh with
wrap-around links.  That is, a 3D torus.  Each vertex of this torus
would be an Infiniscale IV (I4), having single 4x connections to up to
12 host ConnectX-based HCAs and 3-4x connections to its neighbors in
each of six dimensions.  The smallest network that illustrates the most
interesting aspects of this setup is a 3x3x3 torus.  I've created a
basic illustration of this here.  Pick your favorite file format:

        http://bohnsack.com/3DTorus.svg
        http://bohnsack.com/3DTorus.pdf

In this diagram, each square is a single I4, and each line represents 3
independent 4x connections.  To avoid making the diagram too
complicated, host connections are only shown for a single switch.  You
should imagine 12 hosts hanging off of each square.  Again, to avoid an
unreadable diagram, not all of the Y dimension wrap-around links are
shown, but you should consider them present for the purposes of the
network I'm describing.

Questions:

1) What do you call the topology I'm describing, strictly speaking?
It's kind of like each switch chip vertex has a sub-graph connected to
all the host HCAs.  Perhaps this thing is a "decorated 3D Torus"?

2) I think this network is a little bit different than the 3D tori that
have been previously deployed in machines like Red Storm where there is
a network switch for one or at maximum a very few compute clients.  Does
the fact that there are order 10 times more hosts hanging off of each
switch chip vertex in the network I'm describing matter from a routing
perspective?  It seems that the routing problem is mostly the same.
I.e., algorithms to determine a set of deadlock-free routes on the same
basic topology, ignoring the "decorations", are similar.  Is this right?

3) Is there good current support for computing deadlock-free routes on
the network I'm describing in OFED 1.3.1, 1.4, or other?  With which
routing algorithm?  I tried to find a answer to this, by looking through
various OFED documentation, but I still have a bit of confusion on how
to proceed.  Here's the data that I was able to gather.  Can someone
please help to clarify?

- An OpenSM wiki page says that OpenSM supports "Torus routing":
https://wiki.openfabrics.org/tiki-index.php?page=OpenSM&highlight=opensm
- However, the latest release notes I can find don't make any explicit
mention of tori:
http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/opensm_release_notes-3.2.txt;hb=HEAD
- The release notes do mention Dimension Order routing (DOR), and this
might work for a 3D torus, but it seems (per the notes) that this
algorithm, as implemented, is only considered deadlock-free for meshes
(no wrap-around) and hypercubes - no tori.  I understand that when you
"wrap around" the dimension, the virtual channel used needs to change to
avoid deadlock in DOR, and DOR as implemented today doesn't do that.
- Commit b204932d5bd2a88af5ce0989d2dff65d753b3d54 from
git://git.openfabrics.org/~sashak/management.git in March of 2007
mentions some degree of success with LASH on 2D tori, but it's
considered "unoptimized".  Would this work deadlock-free for a 3D torus?
What's the implication of "unoptimized" on something like an 8x8x8 torus
with lots of hosts at each vertex?
- I didn't see any other mention of a torus or tori in the OpenSM commit
logs.

8) Is there an existing utility inside of OFED that can be used to
verify that routes generated by the SM are deadlock free?  I.e., can I
dump routes from OpenSM and then run a utility on them that can identify
potentials for deadlock?

9) I'm aware of ibsim, and others I'm collaborating with have done the
first level of testing with it on 3D tori.  My question is how much more
simulation can I do on a "mock" topology with freely-available tools?
Am I limited to MAD traffic, or is there a way to simulate a real
workload (perhaps MPI traffic)?  How about commercial tools?

10) As I previously mentioned, there are 3 4x links running between each
switch chip, along each dimension.  Our plan is to run these as 3
independent links, but it should be possible to logically aggregate them
into a single 12x link.  At first glance, a 12x link might be subject to
less congestion but it could also incur store-and-forward penalties
producing unwanted latency as 4x flows are converted to 12x and then
back to 4x again.  My questions around this topic: Is any additional
insight into this issue available (theoretical or empirical)?  Should we
even worry about testing 12x connections?

11) An artifact that results from the way we intend to connect the I4
switches together is that there's a possibility of having 9-4x
connections between every other link in one dimension.  E.g., looking in
the the Z dimension, switch-to-switch "widths" could alternate between
9-4x and 3-4x links.  Links in other dimensions would all be limited to
3-4x as before.  The end result of this configuration would be that
certain groups of two I4s and their connected hosts along one dimension
would have a little bit better connectivity than other I4 groupings.
This kind of configuration in our setup would be "free" by simply doing
some logical configuration.  This change may or not be beneficial
depending on our applications and workloads.  I'm not so concerned about
that issue at present.  My main question today is whether this kind of
"heterogeneity" would cause routing issues/complications.  Would it?  If
not, it might be a no-brainer to enable it.  What do you think?

An diagram of the heterogeneous 3D torus I'm talking about is available
here:

        http://bohnsack.com/3DTorus-heterogeneous-Z.svg
        http://bohnsack.com/3DTorus-heterogeneous-Z.pdf

Thanks in advance for your help,

-Matthew