[ofa-general] "ibdiagnet -r" and zero systemguids

Craig Prescott prescott at hpc.ufl.edu
Wed Jul 16 17:34:27 PDT 2008


Hi;

When we run 'ibdiagnet -r' on our OFED 1.2 cluster,
it bombs with a complaint about a system guid that is
zero on our only PCI-X HCA in the fabric (see appended).
ibdiagnet seems to be trying to saw off the leading zeroes
from the system guid, and to have nothing left afterwards
seems unexpected.

Running 'ibdiagnet -r' from an OFED 1.3.1 machine does
not bomb, but I am still concerned/unclear.

My questions are: is it ok to have an HCA running
around on your fabric with a system guid of zero?
What if there was more than one?  Is there any way to
assign this HCA a sensible system guid, and would it
be useful?

The HCA in question is a Cougar cub running the 3.5.0
firmware from Mellanox.  FWIW, the node and port guids
for this HCA look sensible:

[root at submit ~]# tvflash -g
HCA #0
Node  GUID = 0005ad0000050948
Port1 GUID = 0005ad0000050949
Port2 GUID = 0005ad000005094a

If it isn't obvious already, I confess I'm not clear
about how system guids are used.  From what I can gather
from google-ing around, a system guid of zero for an HCA
means that the HCA vendor simply did not assign one.  I
am under the impression that this is uncommon, but not
unheard of.  Is that correct?

I did some searches through both volumes of the 1.2.1 IB
spec and came up empty, but I could have easily missed any
substantial discussion about system guids.  Any pointers or
enlightenment in this area would be appreciated.

Thanks,
Craig Prescott
UF HPC Center

[root at submit ~]# ibdiagnet -r
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
Loading IBDM from: /usr/lib64/ibdm1.2
-W- Topology file is not specified.
     Reports regarding cluster links will use direct routes.
-I- Using port 1 as the local port.
-I- Discovering the subnet ... 394 nodes (46 Switches & 348 CA-s) 
discovered.

-I- Parsing Subnet file:/tmp/ibdiagnet.lst
-I- Defined 382/394 systems/nodes

-I---------------------------------------------------
-I- Bad Guids Info
-I---------------------------------------------------
-W- Found Device with SystemGUID=0x0000000000000000:
     a HCA    The Local Device "submit.ufhpc/P1" 
PortGUID=0x0005ad0000050949 at direct path=""
...
-I---------------------------------------------------
-I- mgid-mlid-HCAs matching table
-I---------------------------------------------------
mgid                                  | mlid   | HCAs
--------------------------------------------------------------------------------


ERROR can't use empty string as operand of "+"
     while executing
"if {([removeLeadingZeros $n] > [removeLeadingZeros $end] + 1)} {
          if {$start == $end} {
             append res "$end,"
          } else {
      ..."
     (procedure "groupNumRanges" line 15)
     invoked from within
"groupNumRanges $NEW_GROUPS($pNs)"
     (procedure "groupingEngine" line 24)
     invoked from within
"groupingEngine $groups"
     (procedure "compressNames" line 12)
     invoked from within
"compressNames $mlidHcas"
     (procedure "reportFabQualities" line 82)
     invoked from within
"reportFabQualities" can't use empty string as operand of "+"




More information about the general mailing list