[ofa-general] ibdiagnet FM master / standby error report
Yevgeny Kliteynik
kliteyn at dev.mellanox.co.il
Tue Aug 26 13:26:30 PDT 2008
Hi Matthias,
Matthias Blankenhaus wrote:
> Howdy !
>
> I have noticed that ibdiagnet reports an error when using a
> master / standby FM configuration. I am using OFED-1.3.1.
>
> Here it goes:
>
> # ibdiagnet
> ....
> -I---------------------------------------------------
> -I- Bad Fabric SM Info
> -I---------------------------------------------------
> -E- Found more then one master SM in the discover fabric
> r1lead/P1 priority:15
> r2lead/P1 priority:0
> ....
> -I- Stages Status Report:
> STAGE Errors Warnings
> Bad GUIDs/LIDs Check 0 0
> Link State Active Check 0 0
> SM Info Check 1 0
> Performance Counters Report 0 6
> Partitions Check 0 0
> IPoIB Subnets Check 0 0
>
>
> This is incorrect as we have only one master namely r1lead. r2lead is a
> standby only.
>
>
> The culprit for this problem seems to be this file:
>
> /usr/lib64/ibdiagnet1.2/ibdebug.tcl
>
> Here is the if stmt that creates the problem:
> ...
> 2988 proc CheckSM {} {
> 2989 global SM G
> 2990 set master 3
> 2991 if {![info exists SM($master)]} {
> 2992 inform "-I-ibdiagnet:bad.sm.header"
> 2993 inform "-E-ibdiagnet:no.SM"
> 2994 } else {
> 2995 if {[llength $SM($master)] != 1} {
> ==> ^^^^
>
> 2996 inform "-I-ibdiagnet:bad.sm.header"
> 2997 inform "-E-ibdiagnet:many.SM.master"
> 2998 foreach element $SM($master) {
> 2999 set tmpDirectPath [lindex $element 0]
> 3000 set nodeName [DrPath2Name $tmpDirectPath -port [GetEntryPort $tmpDirectPath]]
> 3001 if { $tmpDirectPath == "" } {
> ....
>
> It appears that this code does not factor in the priority of an individual
> FM. It simply counts the FM instances and if the resulting number not
> equals 1, then this tools indicates an error.
>
> From studying the OFED code (osm_state_mgr.h::osm_sm_is_greater_than()) it
> is clear that, even if two FM instances for the same fabric have an
> identical priority, there is always only one winner by resolving the tie via guids.
>
> Here is the relevant OFED code:
>
> static inline boolean_t
> osm_sm_is_greater_than(IN const uint8_t l_priority,
> IN const ib_net64_t l_guid,
> IN const uint8_t r_priority, IN const ib_net64_t r_guid)
> {
> if (l_priority > r_priority) {
> return (TRUE);
> } else {
> if (l_priority == r_priority) {
> if (cl_ntoh64(l_guid) < cl_ntoh64(r_guid)) {
> return (TRUE);
> }
> }
> }
> return (FALSE);
> }
>
>
> Thus, in my opinion the check against number of FM instances in
> ibdebug.tcl is superfluous. And indeed, removing the check resolves the
> issue. The new version of the above func looks like this:
>
>
> proc CheckSM {} {
> global SM G
> set master 3
> if {![info exists SM($master)]} {
> inform "-I-ibdiagnet:bad.sm.header"
> inform "-E-ibdiagnet:no.SM"
> }
> return 0
> }
>
>
> This simply checks whether there is a FM instance at all. If there is
> none, then that constitutes an error. However, multiple FM instances
> should not create an error.
This check in ibdiagnet is supposed to report an error
in case of more than one *master* SM in the subnet.
In some cases it may happen, so the check is valid.
However, I think that only master SM can get into that
SM list, so either you really have a problem with
two master SMs in subnet, or there is a bug in ibdiagnet
and somehow it included non-master SM in that list.
-- Yevgeny
>
> Thanx,
> Matthias
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list