[ofa-general] [OpenSM][0/18] - Routing Chaining

Al Chu chu11 at llnl.gov
Mon Sep 15 12:20:48 PDT 2008


Hey Sasha,

As we've discussed before, we wanted to put routing chaining into
opensm.  Here is a patch series to support it.

For others on the list, routing chaining is the ability to configure
the order in which routing algorithms are applied in opensm, i.e.

-R ftree,updn,minhop

Try using ftree routing.  If ftree fails, try updn.  If updn fails,
try minhop.

In order to get this done, some rearchitecture of the routing code had
to be done b/c there is no longer an assumption that only one routing
engine can be specified.  Here's a summary of the overall
rearchitecture.

osm_ucast defaults to minhop - The current code automatically
defaulted to minhop if anything in the selected routing engine failed.
Naturally this had to be changed for routing chaining.  I moved minhop
out of the ucast_mgr code to make it its own routing engine instead.

osm_ucast assumption on routing failures - The current code defaulted
to minhop if anything in the selected routing engine failed.  Because
of this some routing engines (most notably "file" routing)
intentionally "failed" when it wanted default to some portion of
minhop behavior.  All routing behavior had to be moved into routing
engines to have the routing engines fully fail/succeed on their own.

updn routing - currently utilizes the minhop build_fwd_tables but
minhop's code assumes if build_lid_matrices is not-null, it is in
"up/dn routing mode" instead of "minhop mode".  Perfectly fine when
you can specify max of one routing engine, but needs to be abstracted
out of minhop so up/dn is independent in its routing "attempt" in the
chain.

dor routing "dependency" on ucast_mgr - the is_dor flag was
checked/determined inside the ucast_mgr.  Dor routing had to be "split
out" of the ucast manager so its routing engine is independent of
another routing engine's "attempt" in the chain.

minhop routing assumed to never fail - Currently minhop routing cannot
"fail".  So if someone wanted to put minhop into the middle of a
routing chain, it makes no sense.  I assume this was based on legacy,
when the minhop algorithm did not have options like
"guid_routing_order_file" that could be parsed incorrectly.  So I made
changes to allow minhop to have options passed to it that allow it to
"fail" or "move on no matter what".

Subsequently, if all routing chaining inputs from the user fail, a
bare bones "move on no matter what" minhop is executed.  If no routing
algorithm is specified, we still use minhop by default.

So, lots of rearchitecture were done and lots of cleanup was done as
well.  Some bug fixes along the way too.

Naturally, there may be some style differences and some
code-efficiencies I just don't see right now.  I may have missed
something in the routing rearchitecture in part 2.  But at the core,
it seems to work :-)  I've currently only tested against ibism, not a
real cluster.  Hope to do that later on.

Please let me know what you think.

Al

-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




More information about the general mailing list