[ofa-general] ibdiagnet FM master / standby error report

Matthias Blankenhaus matthias at sgi.com
Mon Aug 25 14:09:55 PDT 2008


Howdy !

I have noticed that ibdiagnet reports an error when using a
master / standby FM configuration. I am using OFED-1.3.1.

Here it goes:

# ibdiagnet 
....
-I---------------------------------------------------
-I- Bad Fabric SM Info
-I---------------------------------------------------
-E- Found more then one master SM in the discover fabric
    r1lead/P1  priority:15
    r2lead/P1  priority:0
....
-I- Stages Status Report:
    STAGE                                    Errors Warnings
    Bad GUIDs/LIDs Check                     0      0
    Link State Active Check                  0      0
    SM Info Check                            1      0
    Performance Counters Report              0      6
    Partitions Check                         0      0
    IPoIB Subnets Check                      0      0


This is incorrect as we have only one master namely r1lead.  r2lead is a 
standby only.


The culprit for this problem seems to be this file:

/usr/lib64/ibdiagnet1.2/ibdebug.tcl

Here is the if stmt that creates the problem:
...
2988 proc CheckSM {} {
2989     global SM G
2990     set master 3
2991     if {![info exists SM($master)]} {
2992         inform "-I-ibdiagnet:bad.sm.header"
2993         inform "-E-ibdiagnet:no.SM"
2994     } else {
2995         if {[llength $SM($master)] != 1} {
==>                                     ^^^^ 

2996             inform "-I-ibdiagnet:bad.sm.header"
2997             inform "-E-ibdiagnet:many.SM.master"
2998             foreach element $SM($master) {
2999                 set tmpDirectPath [lindex $element 0]
3000                 set nodeName [DrPath2Name $tmpDirectPath -port [GetEntryPort $tmpDirectPath]]
3001                 if { $tmpDirectPath == "" } {
....

It appears that this code does not factor in the priority of an individual 
FM. It simply counts the FM instances and if the resulting number not 
equals 1, then this tools indicates an error.  

>From studying the OFED code (osm_state_mgr.h::osm_sm_is_greater_than()) it 
is clear that, even if two FM instances for the same fabric have an 
identical priority, there is always only one winner by resolving the tie via guids.

Here is the relevant OFED code:

static inline boolean_t
osm_sm_is_greater_than(IN const uint8_t l_priority,
               IN const ib_net64_t l_guid,
               IN const uint8_t r_priority, IN const ib_net64_t r_guid)
{
    if (l_priority > r_priority) {
        return (TRUE);
    } else {
        if (l_priority == r_priority) {
            if (cl_ntoh64(l_guid) < cl_ntoh64(r_guid)) {
                return (TRUE);
            }
        }
    }
    return (FALSE);
}


Thus, in my opinion the check against number of FM instances in 
ibdebug.tcl is superfluous.  And indeed, removing the check resolves the 
issue.  The new version of the above func looks like this:


proc CheckSM {} {
    global SM G
    set master 3
    if {![info exists SM($master)]} {
        inform "-I-ibdiagnet:bad.sm.header"
        inform "-E-ibdiagnet:no.SM"
    }
    return 0
}


This simply checks whether there is a FM instance at all.  If there is 
none, then that constitutes an error.  However, multiple FM instances
should not create an error.


Thanx,
Matthias



More information about the general mailing list