[ofa-general] Questions Concerning a 3D Torus

Fri Oct 17 04:37:49 PDT 2008

Hi Matthew,

I have tried give some answers/comments on your points below. I hope it
is somewhat useful.

Regards,
Sven-Arne

On Wed, 2008-10-15 at 16:36 -0600, Matthew Bohnsack wrote:
> Hello,
> 
> I have a number of questions related to the construction and operation
> of a 3D torus with Infiniband.
> 
> We would like to create an IB network arranged as a 3D mesh with
> wrap-around links.  That is, a 3D torus.  Each vertex of this torus
> would be an Infiniscale IV (I4), having single 4x connections to up to
> 12 host ConnectX-based HCAs and 3-4x connections to its neighbors in
> each of six dimensions.  The smallest network that illustrates the most
> interesting aspects of this setup is a 3x3x3 torus.  I've created a
> basic illustration of this here.  Pick your favorite file format:
> 
>         http://bohnsack.com/3DTorus.svg
>         http://bohnsack.com/3DTorus.pdf
> 
> In this diagram, each square is a single I4, and each line represents 3
> independent 4x connections.  To avoid making the diagram too
> complicated, host connections are only shown for a single switch.  You
> should imagine 12 hosts hanging off of each square.  Again, to avoid an
> unreadable diagram, not all of the Y dimension wrap-around links are
> shown, but you should consider them present for the purposes of the
> network I'm describing.
> 
> Questions:
> 
> 1) What do you call the topology I'm describing, strictly speaking?
> It's kind of like each switch chip vertex has a sub-graph connected to
> all the host HCAs.  Perhaps this thing is a "decorated 3D Torus"?

I would call it a 3d torus (or a 3-ary 3-cube) ignoring the fact that
there is more than one endnode connected to each switch. Adding more
nodes impacts performance, but does not change the routing properties of
the topology.

> 
> 2) I think this network is a little bit different than the 3D tori that
> have been previously deployed in machines like Red Storm where there is
> a network switch for one or at maximum a very few compute clients.  Does
> the fact that there are order 10 times more hosts hanging off of each
> switch chip vertex in the network I'm describing matter from a routing
> perspective?  It seems that the routing problem is mostly the same.
> I.e., algorithms to determine a set of deadlock-free routes on the same
> basic topology, ignoring the "decorations", are similar.  Is this right?
> 

Yes. As long as you are able to set up deadlock free routes between the
switches it is trivial to add routes the endnodes connected to the
switches. The number of endnodes does not matter from a routing
perspective (aparat from balancing or over/under subscription).

> 3) Is there good current support for computing deadlock-free routes on
> the network I'm describing in OFED 1.3.1, 1.4, or other?  With which
> routing algorithm?  I tried to find a answer to this, by looking through
> various OFED documentation, but I still have a bit of confusion on how
> to proceed.  Here's the data that I was able to gather.  Can someone
> please help to clarify?

As far as I know LASH would be your best option.

> 
> - An OpenSM wiki page says that OpenSM supports "Torus routing":
> https://wiki.openfabrics.org/tiki-index.php?page=OpenSM&highlight=opensm
> - However, the latest release notes I can find don't make any explicit
> mention of tori:
> http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/opensm_release_notes-3.2.txt;hb=HEAD
> - The release notes do mention Dimension Order routing (DOR), and this
> might work for a 3D torus, but it seems (per the notes) that this
> algorithm, as implemented, is only considered deadlock-free for meshes
> (no wrap-around) and hypercubes - no tori.  I understand that when you
> "wrap around" the dimension, the virtual channel used needs to change to
> avoid deadlock in DOR, and DOR as implemented today doesn't do that.

I believe that is correct. You would need e-cube routing and 2 virtual
lanes for deadlock free shortest path routing on a torus. e-cube routing
is hard to implement in IB because you would have to do SL2VL trickery
to be able to change virtual lanes (afaik). Moreover, e-cube routing is
not robust against faults so a single fault would make large parts of
the network unreachable.

> - Commit b204932d5bd2a88af5ce0989d2dff65d753b3d54 from
> git://git.openfabrics.org/~sashak/management.git in March of 2007
> mentions some degree of success with LASH on 2D tori, but it's
> considered "unoptimized".  Would this work deadlock-free for a 3D torus?
> What's the implication of "unoptimized" on something like an 8x8x8 torus
> with lots of hosts at each vertex?
> - I didn't see any other mention of a torus or tori in the OpenSM commit
> logs.

LASH is deadlock free on a 3d torus or any other topology, but the
number of virtual lanes required depends on the topology. If
the cabling and port numbering is consistent (0=W, 1=E, 2=N, 3=S) LASH
requires 1 VL for 2d meshes of any size (equals dimension order routing)
and 3 VLs for 2d tori of any size. For 3d tori it requires 5 VLs. So
far, I only  have experience with LASH from simulations so there might
be other issues on real hardware. Maybe other users on the list have
some experience with LASH on real hardware?

> 8) Is there an existing utility inside of OFED that can be used to
> verify that routes generated by the SM are deadlock free?  I.e., can I
> dump routes from OpenSM and then run a utility on them that can identify
> potentials for deadlock?

I think you can use "ibdiagnet" for some routing algorithms, but not for
LASH since it uses layers to avoid deadlock and this confuses
"ibdiagnet".

> 
> 9) I'm aware of ibsim, and others I'm collaborating with have done the
> first level of testing with it on 3D tori.  My question is how much more
> simulation can I do on a "mock" topology with freely-available tools?
> Am I limited to MAD traffic, or is there a way to simulate a real
> workload (perhaps MPI traffic)?  How about commercial tools?

There is IB packet level simulation support in OMNet++, check
http://www.omnetpp.org/ . Never tried it myself though.

> 
> 10) As I previously mentioned, there are 3 4x links running between each
> switch chip, along each dimension.  Our plan is to run these as 3
> independent links, but it should be possible to logically aggregate them
> into a single 12x link.  At first glance, a 12x link might be subject to
> less congestion but it could also incur store-and-forward penalties
> producing unwanted latency as 4x flows are converted to 12x and then
> back to 4x again.  My questions around this topic: Is any additional
> insight into this issue available (theoretical or empirical)?  Should we
> even worry about testing 12x connections?

I do not know of any such work. There might be a benefit to run with 3
4x links from a fault-tolerance perspective, but don't know for sure...

> 
> 11) An artifact that results from the way we intend to connect the I4
> switches together is that there's a possibility of having 9-4x
> connections between every other link in one dimension.  E.g., looking in
> the the Z dimension, switch-to-switch "widths" could alternate between
> 9-4x and 3-4x links.  Links in other dimensions would all be limited to
> 3-4x as before.  The end result of this configuration would be that
> certain groups of two I4s and their connected hosts along one dimension
> would have a little bit better connectivity than other I4 groupings.
> This kind of configuration in our setup would be "free" by simply doing
> some logical configuration.  This change may or not be beneficial
> depending on our applications and workloads.  I'm not so concerned about
> that issue at present.  My main question today is whether this kind of
> "heterogeneity" would cause routing issues/complications.  Would it?  If
> not, it might be a no-brainer to enable it.  What do you think?

I believe this should be possible without complications, but it is a
special case that the routing algorithm would have to be made aware of.

> 
> An diagram of the heterogeneous 3D torus I'm talking about is available
> here:
> 
>         http://bohnsack.com/3DTorus-heterogeneous-Z.svg
>         http://bohnsack.com/3DTorus-heterogeneous-Z.pdf
>         
> 
> Thanks in advance for your help,
> 
> -Matthew
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general