[openib-general] Anounce: Advanced Diagnostic Tools

Eitan Zahavi eitan at mellanox.co.il
Wed Jan 11 14:20:51 PST 2006


Hi,

With the great help from Danny Zarko and Ariel Libman I was able to upload into
https://openib.org/svn/gen2/utils/src/linux-user the first of several integrated IB
diagnostic tools: ibdiagnet (diagnose network).
The tool depends on ibis, ibdm (available under in the same directory).
It's main differences from the diag tools available under the trunk are:
1. Performs a complete diagnostic procedure, including:
    * discovery,
    * PM counters check,
    * duplicate LID/GUID
    * ALL to ALL connectivity check (based on LFT data extracted from the fabric)
    * Multicast connectivity and report
    * Credit loop analysis
    * and various other fabric statistics
2. If a topology file is provided - all reports are given using system names (rather then
    LID, GUID or directed paths.

###############################################################################################
Here are some stdout examples
------------------------------
1. BAD LIDS
-E- Device(s) with LID = 0x0000 found in the fabric:
     path="1 1 3 5" H-12/U1 PN=2
     path="1 1 3 4" H-11/U1 PN=1
     path="1 4" H-3/U1 PN=1

2. DUPLICATED PORT GUIDS
-E- Devices with identical PortGUID = 0x0002c90000000006 found in the fabric:
     path="1 1" GNU1/main/U2
     path="1 1 5 6" H-9/U1 PN=1
     path="1 1 5 5" H-10/U1 PN=2

3. BAD LINKS
-I- Errors have occurred on the following links (for errors details, look in log
     file /tmp/ibmgtsim.31602/ibdiagnet.log):
     Cable:         GNU1/M/P7(GNU1/main/U4/P4) =---= H-7/P2(H-7/U1/P2)
     Cable:         GNU1/M/P5(GNU1/main/U4/P6) =---= H-5/P2(H-5/U1/P2)

4. TOPOLOGY MATCH
-I- Note that "bad" links and the part of the fabric to which they led (in the
     BFS discovery of the fabric, starting at the local node) are not discovered
     and therefore will be reported as "missing".

   Missing System:H-7(Cougar)
      Should be connected by cable from port: P2(H-7/U1/P2)
      to:GNU1/M/P7(GNU1/main/U4/P4)

   Missing System:H-5(Cougar)
      Should be connected by cable from port: P2(H-5/U1/P2)
      to:GNU1/M/P5(GNU1/main/U4/P6)

5. MULTICAST ROUTING
-I- Scanning all multicast groups for loops and connectivity...
-I- Multicast Group:0xC000 has:2 switches and:2 HCAs
-E- Extra switch:GNU1/leaf1/U1 in group:0xC000
-E- Extra switch:GNU1/main/U4 in group:0xC000
-I- Multicast Group:0xC001 has:4 switches and:4 HCAs
-E- Extra switch:GNU1/leaf1/U1 in group:0xC001
-I- Multicast Group:0xC002 has:5 switches and:5 HCAs
-E- 3 multicast group checks failed

-I---------------------------------------------------
-I- mgid-mlid-HCAs matching table
-I---------------------------------------------------
mgid                                  | mlid   | HCAs
--------------------------------------------------------------------------------
0xff12401bffff0000:0x00000000ffffffff | 0xc000 | H-11/U1,H-12/U1
0xff12401bffff0000:0x0000000000000001 | 0xc001 | H-15/U1,H-3/U1,H-2/U1,H-7/U1
0xff12401bffff0000:0x0000000000000002 | 0xc002 | H-10/U1,H-16/U1,H-4/U1,H-6/U1

6. UNICAST ROUTING:
-I- Verifying all CA to CA paths ...
-E- Unassigned LFT for lid:10 Dead end at:GNU1/main/U1
-E- Fail to find a path from:H-1/U1/1 to:H-12/U1/2
-E- Unassigned LFT for lid:18 Dead end at:GNU1/main/U3
-E- Fail to find a path from:H-1/U1/1 to:H-5/U1/2
[snip]
-E- Found 19 missing paths out of:240 paths

7. CREDIT LOOPS
-I- Tracing all CA to CA paths for Credit Loops potential ...
-E- Potential Credit Loop on Path from:H-1/U1/1 to:H-13/U1/1
   Going:Down from:GNU1/main/U1 to:GNU1/main/U3
   Going:Up from:GNU1/main/U3 to:GNU1/main/U1
   Going:Down from:GNU1/main/U1 to:GNU1/leaf1/U1


NOTE: All the above cases simulated on top of ibmgtsim.
       Errors injected by simulation flows.
######################################################################################
A full man page:
====================
NAME
   ibdiagnet

SYNOPSYS
   ibdiagnet [-c <count>] [-v] [-r] [-t <topo-file>] [-s <sys-name>]
      [-i <dev-index>] [-p <port-num>] [-o <out-dir>]

DESCRIPTION
   ibdiagnet scans the fabric using directed route packets and extracts all the
   available information regarding its connectivity and devices.
   It then produces the following files in the output directory defined by the
   -o option (see below):
     ibdiagnet.lst    - List of all the nodes, ports and links in the fabric
     ibdiagnet.fdbs   - A dump of the unicast forwarding tables of the fabric
                         switches
     ibdiagnet.mcfdbs - A dump of the multicast forwarding tables of the fabric
                         switches
   In addition to generating the files above, the discovery phase also checks for
   duplicate node GUIDs in the IB fabric. If such an error is detected, it is
   displayed on the standard output.
   After the discovery phase is completed, directed route packets are sent
   multiple times (according to the -c option) to detect possible problematic
   paths on which packets may be lost. Such paths are explored, and a report of
   the suspected bad links is displayed on the standard output.
   After scanning the fabric, if the -r option is provided, a full report of the
   fabric qualities is displayed.
   This report includes:
     Number of nodes and systems
     Hop-count information:
          maximal hop-count, an example path, and a hop-count histogram
     All CA-to-CA paths traced
   Note: In case the IB fabric includes only one CA, then CA-to-CA paths are not
   reported.
   Furthermore, if a topology file is provided, ibdiagnet uses the names defined
   in it for the output reports.

OPTIONS
   -c <count>    : The minimal number of packets to be sent across each link
                   (default = 10)
   -v            : Instructs the tool to run in verbose mode
   -r            : Provides a report of the fabric qualities
   -t <topo-file>: Specifies the topology file name
   -s <sys-name> : Specifies the local system name. Meaningful only if a topology
                   file is specified
   -i <dev-index>: Specifies the index of the device of the port used to connect
                   to the IB fabric (in case of multiple devices on the local
                   system)
   -p <port-num> : Specifies the local device's port number used to connect to
                   the IB fabric
   -o <out-dir>  : Specifies the directory where the output files will be placed
                   (default = /tmp/ez)

   -h|--help     : Prints this help information
   -V|--version  : Prints the version of the tool
      --vars     : Prints the tool's environment variables and their values

ERROR CODES
   1 - Failed to fully discover the fabric
   2 - Failed to parse command line options
   3 - Some packet drop observed
   4 - Mismatch with provided topology



More information about the general mailing list