[openib-general] [ANNOUNCE] Updated OpenIB diagnostics

Eitan Zahavi eitan at mellanox.co.il
Sun Jan 1 02:12:40 PST 2006


Hi Hal,

Do you expect the system administrator to manually fill in the
discover.map including ALL nodes in the fabric, their guids and "name"? 
For a large cluster that number is quite large.

In the past I was proposing using "system IB connectivity model"* (IBNL)
for providing similar but superior capability. Using IBNL describing
each system type (should be provided by the system vendor - or extracted
once for each system type) the administrator can avoid the need to fill
in the data (guid and name) for every node in the cluster. 

The administrator can select one of two** options:
1. Write a "system-level-topology"*** file to describe the expected
topology instantiating systems only (not devices). This topology file is
then compared versus the discovered topology and used (the names from
the file as well as link width and speed) by all diagnostic tools for
reporting errors.
2. Write "annotation" file (ala discover.map syntax) that includes as
few as one device per system such that the extracted node level topology
could be matched against that spec and mapped dynamically.

* IBNL is describes the IB connectivity inside a system in a
hierarchical manner.
  It enables specifying link width and speed inside the box and on the
system interface. 
  These properties are automatically propagated to the created topology
- and enables   
  their validation on the extracted topology.
  The topology created hold both the system-to-system connectivity layer
as well as the 
  flattened IB node and link layer (the later is similar to the
discover.topo). As IBNL is 
  describing the systems a common naming scheme for the devices in each
such system 
  is provided by the system vendor and not freely annotated by the
system dministrator. 
  Such that any error reported (like bad internal link or device) can be
easily understood 
  by the vendor too. Furthermore, when several devices misbehave - the
code can 
  correlate them to a specific board in the system  and report the
problem once for that 
  entire board (this is demonstrated today by code under the ibdm tree -
see below).

** Having a "spec topology" has great advantages over extracted one:
  Several utilities let you:
  + Analyze your topology even before one cable is laid out for credit
loops, num hops,    
     asymmetrical routing patterns, etc
  + Find routing errors that may very well happen on large cluster due
to the 
     human process of connecting thousand of cables.
  + Find links that did not start up in the right speed or width due to
bad cables or their
     connections.

*** By "system-level-topology" I mean a file that is made of the list of
systems and not the list of IB nodes (embedded within this system). For
large cluster using 288 port switch systems the number of elements in
the file is reduced 32 times...

The code to allow the option 1 is available under:
https://openib.org/svn/gen2/utils/src/linux-user/ibdm

To support option 2 this code could be easily enhanced with a new
"annotation" algorithm.


Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: openib-general-bounces at openib.org [mailto:openib-general-
> bounces at openib.org] On Behalf Of Hal Rosenstock
> Sent: Saturday, December 31, 2005 8:07 PM
> To: openib-general at openib.org
> Subject: [openib-general] [ANNOUNCE] Updated OpenIB diagnostics
> 
> Hi,
> 
> The OpenIB diagnostics
> (https://openib.org/svn/gen2/trunk/src/userspace/management/diags)
have
> been updated as follows:
> 
> 1. discover.pl diagnostic tool added
> discover.pl uses a topology file create by ibnetdiscover and a
discover.map
> file which the network administrator creates which indicates the nodes
> to be expected and a discover.topo file which is the expected
connectivity
> and produces a new connectivity file (discover.topo.new) and outputs
> the changes to stdout. The network administrator can choose to replace
> the "old" topo file with the new one or certain changes in.
> 
> The syntax of the discover.map file is:
> <nodeGUID>|port|"Text for node"|<NodeDescription from ibnetdiscover
format>
> e.g.
> 8f10400410015|8|"ISR 6000"|# SW-6IB4 Voltaire port 0 lid 5
> 8f10403960558|2|"HCA 1"|# MT23108 InfiniHost Mellanox Technologies
> 
> The syntax of the old and new topo files (discover.topo and
discover.topo.new)
> are:
> <LocalPort>|<LocalNodeGUID>|<RemotePort>|<RemoteNodeGUID>
> e.g.
> 10|5442ba00003080|1|8f10400410015
> 
> These topo files are produced by the discover.pl tool.
> 
> 2. ibportstate diagnostic tool added to query, disable, and enable
> switch ports
> 
> 3. Added error only mode to diagnostic scripts so less data to weed
> through on a large fabric (also verbose mode to see everything)
> 
> 4. Tree structure collapsed so all tools in same directory as opposed
to
> individual ones and build simplified
> 
> Let me know about any comments or issues. Thanks.
> 
> -- Hal
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list