[openib-general] Anounce: Advanced Diagnostic Tools
Eitan Zahavi
eitan at mellanox.co.il
Wed Jan 11 14:20:51 PST 2006
Hi,
With the great help from Danny Zarko and Ariel Libman I was able to upload into
https://openib.org/svn/gen2/utils/src/linux-user the first of several integrated IB
diagnostic tools: ibdiagnet (diagnose network).
The tool depends on ibis, ibdm (available under in the same directory).
It's main differences from the diag tools available under the trunk are:
1. Performs a complete diagnostic procedure, including:
* discovery,
* PM counters check,
* duplicate LID/GUID
* ALL to ALL connectivity check (based on LFT data extracted from the fabric)
* Multicast connectivity and report
* Credit loop analysis
* and various other fabric statistics
2. If a topology file is provided - all reports are given using system names (rather then
LID, GUID or directed paths.
###############################################################################################
Here are some stdout examples
------------------------------
1. BAD LIDS
-E- Device(s) with LID = 0x0000 found in the fabric:
path="1 1 3 5" H-12/U1 PN=2
path="1 1 3 4" H-11/U1 PN=1
path="1 4" H-3/U1 PN=1
2. DUPLICATED PORT GUIDS
-E- Devices with identical PortGUID = 0x0002c90000000006 found in the fabric:
path="1 1" GNU1/main/U2
path="1 1 5 6" H-9/U1 PN=1
path="1 1 5 5" H-10/U1 PN=2
3. BAD LINKS
-I- Errors have occurred on the following links (for errors details, look in log
file /tmp/ibmgtsim.31602/ibdiagnet.log):
Cable: GNU1/M/P7(GNU1/main/U4/P4) =---= H-7/P2(H-7/U1/P2)
Cable: GNU1/M/P5(GNU1/main/U4/P6) =---= H-5/P2(H-5/U1/P2)
4. TOPOLOGY MATCH
-I- Note that "bad" links and the part of the fabric to which they led (in the
BFS discovery of the fabric, starting at the local node) are not discovered
and therefore will be reported as "missing".
Missing System:H-7(Cougar)
Should be connected by cable from port: P2(H-7/U1/P2)
to:GNU1/M/P7(GNU1/main/U4/P4)
Missing System:H-5(Cougar)
Should be connected by cable from port: P2(H-5/U1/P2)
to:GNU1/M/P5(GNU1/main/U4/P6)
5. MULTICAST ROUTING
-I- Scanning all multicast groups for loops and connectivity...
-I- Multicast Group:0xC000 has:2 switches and:2 HCAs
-E- Extra switch:GNU1/leaf1/U1 in group:0xC000
-E- Extra switch:GNU1/main/U4 in group:0xC000
-I- Multicast Group:0xC001 has:4 switches and:4 HCAs
-E- Extra switch:GNU1/leaf1/U1 in group:0xC001
-I- Multicast Group:0xC002 has:5 switches and:5 HCAs
-E- 3 multicast group checks failed
-I---------------------------------------------------
-I- mgid-mlid-HCAs matching table
-I---------------------------------------------------
mgid | mlid | HCAs
--------------------------------------------------------------------------------
0xff12401bffff0000:0x00000000ffffffff | 0xc000 | H-11/U1,H-12/U1
0xff12401bffff0000:0x0000000000000001 | 0xc001 | H-15/U1,H-3/U1,H-2/U1,H-7/U1
0xff12401bffff0000:0x0000000000000002 | 0xc002 | H-10/U1,H-16/U1,H-4/U1,H-6/U1
6. UNICAST ROUTING:
-I- Verifying all CA to CA paths ...
-E- Unassigned LFT for lid:10 Dead end at:GNU1/main/U1
-E- Fail to find a path from:H-1/U1/1 to:H-12/U1/2
-E- Unassigned LFT for lid:18 Dead end at:GNU1/main/U3
-E- Fail to find a path from:H-1/U1/1 to:H-5/U1/2
[snip]
-E- Found 19 missing paths out of:240 paths
7. CREDIT LOOPS
-I- Tracing all CA to CA paths for Credit Loops potential ...
-E- Potential Credit Loop on Path from:H-1/U1/1 to:H-13/U1/1
Going:Down from:GNU1/main/U1 to:GNU1/main/U3
Going:Up from:GNU1/main/U3 to:GNU1/main/U1
Going:Down from:GNU1/main/U1 to:GNU1/leaf1/U1
NOTE: All the above cases simulated on top of ibmgtsim.
Errors injected by simulation flows.
######################################################################################
A full man page:
====================
NAME
ibdiagnet
SYNOPSYS
ibdiagnet [-c <count>] [-v] [-r] [-t <topo-file>] [-s <sys-name>]
[-i <dev-index>] [-p <port-num>] [-o <out-dir>]
DESCRIPTION
ibdiagnet scans the fabric using directed route packets and extracts all the
available information regarding its connectivity and devices.
It then produces the following files in the output directory defined by the
-o option (see below):
ibdiagnet.lst - List of all the nodes, ports and links in the fabric
ibdiagnet.fdbs - A dump of the unicast forwarding tables of the fabric
switches
ibdiagnet.mcfdbs - A dump of the multicast forwarding tables of the fabric
switches
In addition to generating the files above, the discovery phase also checks for
duplicate node GUIDs in the IB fabric. If such an error is detected, it is
displayed on the standard output.
After the discovery phase is completed, directed route packets are sent
multiple times (according to the -c option) to detect possible problematic
paths on which packets may be lost. Such paths are explored, and a report of
the suspected bad links is displayed on the standard output.
After scanning the fabric, if the -r option is provided, a full report of the
fabric qualities is displayed.
This report includes:
Number of nodes and systems
Hop-count information:
maximal hop-count, an example path, and a hop-count histogram
All CA-to-CA paths traced
Note: In case the IB fabric includes only one CA, then CA-to-CA paths are not
reported.
Furthermore, if a topology file is provided, ibdiagnet uses the names defined
in it for the output reports.
OPTIONS
-c <count> : The minimal number of packets to be sent across each link
(default = 10)
-v : Instructs the tool to run in verbose mode
-r : Provides a report of the fabric qualities
-t <topo-file>: Specifies the topology file name
-s <sys-name> : Specifies the local system name. Meaningful only if a topology
file is specified
-i <dev-index>: Specifies the index of the device of the port used to connect
to the IB fabric (in case of multiple devices on the local
system)
-p <port-num> : Specifies the local device's port number used to connect to
the IB fabric
-o <out-dir> : Specifies the directory where the output files will be placed
(default = /tmp/ez)
-h|--help : Prints this help information
-V|--version : Prints the version of the tool
--vars : Prints the tool's environment variables and their values
ERROR CODES
1 - Failed to fully discover the fabric
2 - Failed to parse command line options
3 - Some packet drop observed
4 - Mismatch with provided topology
More information about the general
mailing list