[ofa-general] [OpenSM] updn routing performance fix???

Thu Feb 28 20:17:19 PST 2008

Hey Sasha,

While doing some other development, I noticed that some switch ports were
not used in routing even though they were up/healthy.  I wrote a script
(will try to submit to infiniband-diags when I clean it up) that analyzes
dump_lfts to see what ports are used in routing.  Here's an output chunk:

Unbalanced Switch Port Usage: MT47396 Infiniscale-III Mellanox
Technologies, 0x000b8cffff004662, 40
Port 013: 12
Port 014: 12
Port 015: 12
Port 016: 12
Port 017: 12
Port 018: 12
Port 019: 12
Port 020: 12
Port 021: 12
Port 022: 0
Port 023: 11
Port 024: 11

In the above example, Port 022 is not used for routing at all on this
switch.  Naturally, we think this is bad.

After some investigation, I found out that after the initial heavy sweep
is done, some of the ports on some switches are down (I assume hardware
racing during bringup), and thus opensm does not route through those
ports.  When opensm does a heavy resweep later on (I assume b/c some traps
are received when those down ports come up), opensm keeps the same old
forwarding tables from before b/c ignore_existing_lfts is FALSE and b/c
the least hops are the same (other ports on the switch go to the same
parent).  Thus, we get healthy ports not forwarding to a parent switch.

There are multiple ways to deal with this.  I made the attached patch
which solved the problem on one of our test clusters.  It's pretty simple.
 Store all of the "bad ports" that were found during a switch
configuration.  During the next heavy resweep, if some of those "bad
ports" are now up, I set ignore_existing_lfts to TRUE for just that
switch, leading to a completely new forwarding table of the switch.

During my performance testing on this patch, performance with a few
mpibench tests is actually worse by a few percent with this patch.  I am
only using 120 of 144 nodes on this cluster.  It's not a big cluster, has
two levels worth of switches (24 port switches going up to a 288 port
switch.  Yup, the cluster is not "filled out" yet :-).  So there is some
randomness on which specific nodes run the job and if the lid routing
layout is better/worse for that specific set of nodes.

Intuitively, we think this will be better as a whole even though my
current testing can't show it.  Can you think of anything that would make
this patch worse for performance as a whole?  Could you see some side
effect leading to a lot more traffic on the network?

Al

-- 
Albert Chu
chu11 at llnl.gov
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080228/28ce90f9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Do-not-ignore-existing-lfts-when-new-ports-exist.patch
Type: text/x-patch
Size: 6677 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20080228/28ce90f9/attachment.bin>