[openib-general] [ANNOUNCE] Updated OpenIB diagnostics

Hal Rosenstock halr at voltaire.com
Tue Jan 3 16:27:33 PST 2006


Hi Eitan,

On Sun, 2006-01-01 at 05:12, Eitan Zahavi wrote:
> Hi Hal,
> 
> Do you expect the system administrator to manually fill in the
> discover.map including ALL nodes in the fabric, their guids and "name"? 

Yes, although tools can assist with extracting the GUIDs, the names need
annotation.

> For a large cluster that number is quite large.
> 
> In the past I was proposing using "system IB connectivity model"* (IBNL)
> for providing similar but superior capability.

How are the names configured using this approach ?

>  Using IBNL describing
> each system type (should be provided by the system vendor - or extracted
> once for each system type) the administrator can avoid the need to fill
> in the data (guid and name) for every node in the cluster. 

The names in the discover tool are not system type. They are more like
system location (descriptive name) although there is no requirement.

> The administrator can select one of two** options:
> 1. Write a "system-level-topology"*** file to describe the expected
> topology instantiating systems only (not devices). This topology file is
> then compared versus the discovered topology and used (the names from
> the file as well as link width and speed) by all diagnostic tools for
> reporting errors.

How are board swaps within a system reported ?

> 2. Write "annotation" file (ala discover.map syntax) that includes as
> few as one device per system such that the extracted node level topology
> could be matched against that spec and mapped dynamically.

This seems to me like the analog. The difference is that one device gets
a name. That will occur within OpenIB once the system boundary work is
implemented (logical to physical mapping).

> * IBNL is describes the IB connectivity inside a system in a
> hierarchical manner.

Yes, it needs to be hierarchial.

>   It enables specifying link width and speed inside the box and on the
> system interface. 
>   These properties are automatically propagated to the created topology
> - and enables   
>   their validation on the extracted topology.
>   The topology created hold both the system-to-system connectivity layer
> as well as the 
>   flattened IB node and link layer (the later is similar to the
> discover.topo). As IBNL is 
>   describing the systems a common naming scheme for the devices in each
> such system 
>   is provided by the system vendor and not freely annotated by the
> system dministrator. 

Why shouldn't the system admin be allowed to use a name friendly to them
? It should point to a device type which is supplied by the vendor.

-- Hal

>   Such that any error reported (like bad internal link or device) can be
> easily understood 
>   by the vendor too. Furthermore, when several devices misbehave - the
> code can 
>   correlate them to a specific board in the system  and report the
> problem once for that 
>   entire board (this is demonstrated today by code under the ibdm tree -
> see below).
> 
> ** Having a "spec topology" has great advantages over extracted one:
>   Several utilities let you:
>   + Analyze your topology even before one cable is laid out for credit
> loops, num hops,    
>      asymmetrical routing patterns, etc
>   + Find routing errors that may very well happen on large cluster due
> to the 
>      human process of connecting thousand of cables.
>   + Find links that did not start up in the right speed or width due to
> bad cables or their
>      connections.
> 
> *** By "system-level-topology" I mean a file that is made of the list of
> systems and not the list of IB nodes (embedded within this system). For
> large cluster using 288 port switch systems the number of elements in
> the file is reduced 32 times...
> 
> The code to allow the option 1 is available under:
> https://openib.org/svn/gen2/utils/src/linux-user/ibdm
> 
> To support option 2 this code could be easily enhanced with a new
> "annotation" algorithm.
> 
> 
> Eitan Zahavi
> Design Technology Director
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
> 
> > -----Original Message-----
> > From: openib-general-bounces at openib.org [mailto:openib-general-
> > bounces at openib.org] On Behalf Of Hal Rosenstock
> > Sent: Saturday, December 31, 2005 8:07 PM
> > To: openib-general at openib.org
> > Subject: [openib-general] [ANNOUNCE] Updated OpenIB diagnostics
> > 
> > Hi,
> > 
> > The OpenIB diagnostics
> > (https://openib.org/svn/gen2/trunk/src/userspace/management/diags)
> have
> > been updated as follows:
> > 
> > 1. discover.pl diagnostic tool added
> > discover.pl uses a topology file create by ibnetdiscover and a
> discover.map
> > file which the network administrator creates which indicates the nodes
> > to be expected and a discover.topo file which is the expected
> connectivity
> > and produces a new connectivity file (discover.topo.new) and outputs
> > the changes to stdout. The network administrator can choose to replace
> > the "old" topo file with the new one or certain changes in.
> > 
> > The syntax of the discover.map file is:
> > <nodeGUID>|port|"Text for node"|<NodeDescription from ibnetdiscover
> format>
> > e.g.
> > 8f10400410015|8|"ISR 6000"|# SW-6IB4 Voltaire port 0 lid 5
> > 8f10403960558|2|"HCA 1"|# MT23108 InfiniHost Mellanox Technologies
> > 
> > The syntax of the old and new topo files (discover.topo and
> discover.topo.new)
> > are:
> > <LocalPort>|<LocalNodeGUID>|<RemotePort>|<RemoteNodeGUID>
> > e.g.
> > 10|5442ba00003080|1|8f10400410015
> > 
> > These topo files are produced by the discover.pl tool.
> > 
> > 2. ibportstate diagnostic tool added to query, disable, and enable
> > switch ports
> > 
> > 3. Added error only mode to diagnostic scripts so less data to weed
> > through on a large fabric (also verbose mode to see everything)
> > 
> > 4. Tree structure collapsed so all tools in same directory as opposed
> to
> > individual ones and build simplified
> > 
> > Let me know about any comments or issues. Thanks.
> > 
> > -- Hal
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list