[ofa-general] Re: "ibdiagnet -r" and zero systemguids

Oren Kladnitsky orenk at dev.mellanox.co.il
Mon Jul 21 05:52:18 PDT 2008


Craig Prescott wrote:
>
> I forgot to add that other than this
> SystemGUID=0x0000000000000000 issue, the HCA appears
> to work perfectly.
>
> Thanks,
> Craig
>
> Craig Prescott wrote:
>>
>> Hi;
>>
>> When we run 'ibdiagnet -r' on our OFED 1.2 cluster,
>> it bombs with a complaint about a system guid that is
>> zero on our only PCI-X HCA in the fabric (see appended).
>> ibdiagnet seems to be trying to saw off the leading zeroes
>> from the system guid, and to have nothing left afterwards
>> seems unexpected.
>>
>> Running 'ibdiagnet -r' from an OFED 1.3.1 machine does
>> not bomb, but I am still concerned/unclear.
>>
>> My questions are: is it ok to have an HCA running
>> around on your fabric with a system guid of zero?
>> What if there was more than one?  Is there any way to
>> assign this HCA a sensible system guid, and would it
>> be useful?
>>
>> The HCA in question is a Cougar cub running the 3.5.0
>> firmware from Mellanox.  FWIW, the node and port guids
>> for this HCA look sensible:
>>
>> [root at submit ~]# tvflash -g
>> HCA #0
>> Node  GUID = 0005ad0000050948
>> Port1 GUID = 0005ad0000050949
>> Port2 GUID = 0005ad000005094a
>>
>> If it isn't obvious already, I confess I'm not clear
>> about how system guids are used.  From what I can gather
>> from google-ing around, a system guid of zero for an HCA
>> means that the HCA vendor simply did not assign one.  I
>> am under the impression that this is uncommon, but not
>> unheard of.  Is that correct?
>>
>> I did some searches through both volumes of the 1.2.1 IB
>> spec and came up empty, but I could have easily missed any
>> substantial discussion about system guids.  Any pointers or
>> enlightenment in this area would be appreciated.
>>
>> Thanks,
>> Craig Prescott
>> UF HPC Center
>>
>> [root at submit ~]# ibdiagnet -r
>> Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
>> Loading IBDM from: /usr/lib64/ibdm1.2
>> -W- Topology file is not specified.
>>     Reports regarding cluster links will use direct routes.
>> -I- Using port 1 as the local port.
>> -I- Discovering the subnet ... 394 nodes (46 Switches & 348 CA-s) 
>> discovered.
>>
>> -I- Parsing Subnet file:/tmp/ibdiagnet.lst
>> -I- Defined 382/394 systems/nodes
>>
>> -I---------------------------------------------------
>> -I- Bad Guids Info
>> -I---------------------------------------------------
>> -W- Found Device with SystemGUID=0x0000000000000000:
>>     a HCA    The Local Device "submit.ufhpc/P1" 
>> PortGUID=0x0005ad0000050949 at direct path=""
>> ...
>> -I---------------------------------------------------
>> -I- mgid-mlid-HCAs matching table
>> -I---------------------------------------------------
>> mgid                                  | mlid   | HCAs
>> -------------------------------------------------------------------------------- 
>>
>>
>>
>> ERROR can't use empty string as operand of "+"
>>     while executing
>> "if {([removeLeadingZeros $n] > [removeLeadingZeros $end] + 1)} {
>>          if {$start == $end} {
>>             append res "$end,"
>>          } else {
>>      ..."
>>     (procedure "groupNumRanges" line 15)
>>     invoked from within
>> "groupNumRanges $NEW_GROUPS($pNs)"
>>     (procedure "groupingEngine" line 24)
>>     invoked from within
>> "groupingEngine $groups"
>>     (procedure "compressNames" line 12)
>>     invoked from within
>> "compressNames $mlidHcas"
>>     (procedure "reportFabQualities" line 82)
>>     invoked from within
>> "reportFabQualities" can't use empty string as operand of "+"
>>
>>
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general

Hi.

- ibdiagnet 1.2 crash when encounter a zero sys image guid  ==> As you 
can see, this is fixed in OFED 1.3

- A system image guid is used to identify nodes that belong to the same 
system.
  For HCAs, it is purely informational. For switches, it assists the SM 
in some advances routing features.
  Bottom line - no "real" harm to the IB functionality if one or more 
HCAs hasv system image guid 0.

- You can set the system image guid using the mstflint tool (Mellanox 
firmware burning tool). However, if you used
   tvflash to burn the HCA firmware, it is advised to continue using 
tvflash (which I'm not familiar with).
   You can use mstflint to query the device firmware with no risk - Run 
"mstflint -d mthca0 q" .


Regards,
Oren.





More information about the general mailing list