[ofa-general] ibdiagnet FM master / standby error report
Matthias Blankenhaus
matthias at sgi.com
Mon Aug 25 14:09:55 PDT 2008
Howdy !
I have noticed that ibdiagnet reports an error when using a
master / standby FM configuration. I am using OFED-1.3.1.
Here it goes:
# ibdiagnet
....
-I---------------------------------------------------
-I- Bad Fabric SM Info
-I---------------------------------------------------
-E- Found more then one master SM in the discover fabric
r1lead/P1 priority:15
r2lead/P1 priority:0
....
-I- Stages Status Report:
STAGE Errors Warnings
Bad GUIDs/LIDs Check 0 0
Link State Active Check 0 0
SM Info Check 1 0
Performance Counters Report 0 6
Partitions Check 0 0
IPoIB Subnets Check 0 0
This is incorrect as we have only one master namely r1lead. r2lead is a
standby only.
The culprit for this problem seems to be this file:
/usr/lib64/ibdiagnet1.2/ibdebug.tcl
Here is the if stmt that creates the problem:
...
2988 proc CheckSM {} {
2989 global SM G
2990 set master 3
2991 if {![info exists SM($master)]} {
2992 inform "-I-ibdiagnet:bad.sm.header"
2993 inform "-E-ibdiagnet:no.SM"
2994 } else {
2995 if {[llength $SM($master)] != 1} {
==> ^^^^
2996 inform "-I-ibdiagnet:bad.sm.header"
2997 inform "-E-ibdiagnet:many.SM.master"
2998 foreach element $SM($master) {
2999 set tmpDirectPath [lindex $element 0]
3000 set nodeName [DrPath2Name $tmpDirectPath -port [GetEntryPort $tmpDirectPath]]
3001 if { $tmpDirectPath == "" } {
....
It appears that this code does not factor in the priority of an individual
FM. It simply counts the FM instances and if the resulting number not
equals 1, then this tools indicates an error.
>From studying the OFED code (osm_state_mgr.h::osm_sm_is_greater_than()) it
is clear that, even if two FM instances for the same fabric have an
identical priority, there is always only one winner by resolving the tie via guids.
Here is the relevant OFED code:
static inline boolean_t
osm_sm_is_greater_than(IN const uint8_t l_priority,
IN const ib_net64_t l_guid,
IN const uint8_t r_priority, IN const ib_net64_t r_guid)
{
if (l_priority > r_priority) {
return (TRUE);
} else {
if (l_priority == r_priority) {
if (cl_ntoh64(l_guid) < cl_ntoh64(r_guid)) {
return (TRUE);
}
}
}
return (FALSE);
}
Thus, in my opinion the check against number of FM instances in
ibdebug.tcl is superfluous. And indeed, removing the check resolves the
issue. The new version of the above func looks like this:
proc CheckSM {} {
global SM G
set master 3
if {![info exists SM($master)]} {
inform "-I-ibdiagnet:bad.sm.header"
inform "-E-ibdiagnet:no.SM"
}
return 0
}
This simply checks whether there is a FM instance at all. If there is
none, then that constitutes an error. However, multiple FM instances
should not create an error.
Thanx,
Matthias
More information about the general
mailing list