[ofa-general] ibdiagnet FM master / standby error report

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Tue Aug 26 13:26:30 PDT 2008


Hi Matthias,

Matthias Blankenhaus wrote:
> Howdy !
> 
> I have noticed that ibdiagnet reports an error when using a
> master / standby FM configuration. I am using OFED-1.3.1.
> 
> Here it goes:
> 
> # ibdiagnet 
> ....
> -I---------------------------------------------------
> -I- Bad Fabric SM Info
> -I---------------------------------------------------
> -E- Found more then one master SM in the discover fabric
>     r1lead/P1  priority:15
>     r2lead/P1  priority:0
> ....
> -I- Stages Status Report:
>     STAGE                                    Errors Warnings
>     Bad GUIDs/LIDs Check                     0      0
>     Link State Active Check                  0      0
>     SM Info Check                            1      0
>     Performance Counters Report              0      6
>     Partitions Check                         0      0
>     IPoIB Subnets Check                      0      0
> 
> 
> This is incorrect as we have only one master namely r1lead.  r2lead is a 
> standby only.
> 
> 
> The culprit for this problem seems to be this file:
> 
> /usr/lib64/ibdiagnet1.2/ibdebug.tcl
> 
> Here is the if stmt that creates the problem:
> ...
> 2988 proc CheckSM {} {
> 2989     global SM G
> 2990     set master 3
> 2991     if {![info exists SM($master)]} {
> 2992         inform "-I-ibdiagnet:bad.sm.header"
> 2993         inform "-E-ibdiagnet:no.SM"
> 2994     } else {
> 2995         if {[llength $SM($master)] != 1} {
> ==>                                     ^^^^ 
> 
> 2996             inform "-I-ibdiagnet:bad.sm.header"
> 2997             inform "-E-ibdiagnet:many.SM.master"
> 2998             foreach element $SM($master) {
> 2999                 set tmpDirectPath [lindex $element 0]
> 3000                 set nodeName [DrPath2Name $tmpDirectPath -port [GetEntryPort $tmpDirectPath]]
> 3001                 if { $tmpDirectPath == "" } {
> ....
> 
> It appears that this code does not factor in the priority of an individual 
> FM. It simply counts the FM instances and if the resulting number not 
> equals 1, then this tools indicates an error.  
> 
> From studying the OFED code (osm_state_mgr.h::osm_sm_is_greater_than()) it 
> is clear that, even if two FM instances for the same fabric have an 
> identical priority, there is always only one winner by resolving the tie via guids.
> 
> Here is the relevant OFED code:
> 
> static inline boolean_t
> osm_sm_is_greater_than(IN const uint8_t l_priority,
>                IN const ib_net64_t l_guid,
>                IN const uint8_t r_priority, IN const ib_net64_t r_guid)
> {
>     if (l_priority > r_priority) {
>         return (TRUE);
>     } else {
>         if (l_priority == r_priority) {
>             if (cl_ntoh64(l_guid) < cl_ntoh64(r_guid)) {
>                 return (TRUE);
>             }
>         }
>     }
>     return (FALSE);
> }
> 
> 
> Thus, in my opinion the check against number of FM instances in 
> ibdebug.tcl is superfluous.  And indeed, removing the check resolves the 
> issue.  The new version of the above func looks like this:
> 
> 
> proc CheckSM {} {
>     global SM G
>     set master 3
>     if {![info exists SM($master)]} {
>         inform "-I-ibdiagnet:bad.sm.header"
>         inform "-E-ibdiagnet:no.SM"
>     }
>     return 0
> }
> 
> 
> This simply checks whether there is a FM instance at all.  If there is 
> none, then that constitutes an error.  However, multiple FM instances
> should not create an error.

This check in ibdiagnet is supposed to report an error
in case of more than one *master* SM in the subnet.
In some cases it may happen, so the check is valid.
However, I think that only master SM can get into that
SM list, so either you really have a problem with
two master SMs in subnet, or there is a bug in ibdiagnet
and somehow it included non-master SM in that list.

-- Yevgeny

> 
> Thanx,
> Matthias
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 




More information about the general mailing list