[ofa-general] ibdiagnet FM master / standby error report

Oren Kladnitsky orenk at dev.mellanox.co.il
Thu Aug 28 08:36:01 PDT 2008


Yevgeny Kliteynik wrote:
> Hi Matthias,
>
> Matthias Blankenhaus wrote:
>> Howdy !
>>
>> I have noticed that ibdiagnet reports an error when using a
>> master / standby FM configuration. I am using OFED-1.3.1.
>>
>> Here it goes:
>>
>> # ibdiagnet ....
>> -I---------------------------------------------------
>> -I- Bad Fabric SM Info
>> -I---------------------------------------------------
>> -E- Found more then one master SM in the discover fabric
>>     r1lead/P1  priority:15
>>     r2lead/P1  priority:0
>> ....
>> -I- Stages Status Report:
>>     STAGE                                    Errors Warnings
>>     Bad GUIDs/LIDs Check                     0      0
>>     Link State Active Check                  0      0
>>     SM Info Check                            1      0
>>     Performance Counters Report              0      6
>>     Partitions Check                         0      0
>>     IPoIB Subnets Check                      0      0
>>
>>
>> This is incorrect as we have only one master namely r1lead.  r2lead 
>> is a standby only.
>>
>>
>> The culprit for this problem seems to be this file:
>>
>> /usr/lib64/ibdiagnet1.2/ibdebug.tcl
>>
>> Here is the if stmt that creates the problem:
>> ...
>> 2988 proc CheckSM {} {
>> 2989     global SM G
>> 2990     set master 3
>> 2991     if {![info exists SM($master)]} {
>> 2992         inform "-I-ibdiagnet:bad.sm.header"
>> 2993         inform "-E-ibdiagnet:no.SM"
>> 2994     } else {
>> 2995         if {[llength $SM($master)] != 1} {
>> ==>                                     ^^^^
>> 2996             inform "-I-ibdiagnet:bad.sm.header"
>> 2997             inform "-E-ibdiagnet:many.SM.master"
>> 2998             foreach element $SM($master) {
>> 2999                 set tmpDirectPath [lindex $element 0]
>> 3000                 set nodeName [DrPath2Name $tmpDirectPath -port 
>> [GetEntryPort $tmpDirectPath]]
>> 3001                 if { $tmpDirectPath == "" } {
>> ....
>>
>> It appears that this code does not factor in the priority of an 
>> individual FM. It simply counts the FM instances and if the resulting 
>> number not equals 1, then this tools indicates an error. 
>> From studying the OFED code 
>> (osm_state_mgr.h::osm_sm_is_greater_than()) it is clear that, even if 
>> two FM instances for the same fabric have an identical priority, 
>> there is always only one winner by resolving the tie via guids.
>>
>> Here is the relevant OFED code:
>>
>> static inline boolean_t
>> osm_sm_is_greater_than(IN const uint8_t l_priority,
>>                IN const ib_net64_t l_guid,
>>                IN const uint8_t r_priority, IN const ib_net64_t r_guid)
>> {
>>     if (l_priority > r_priority) {
>>         return (TRUE);
>>     } else {
>>         if (l_priority == r_priority) {
>>             if (cl_ntoh64(l_guid) < cl_ntoh64(r_guid)) {
>>                 return (TRUE);
>>             }
>>         }
>>     }
>>     return (FALSE);
>> }
>>
>>
>> Thus, in my opinion the check against number of FM instances in 
>> ibdebug.tcl is superfluous.  And indeed, removing the check resolves 
>> the issue.  The new version of the above func looks like this:
>>
>>
>> proc CheckSM {} {
>>     global SM G
>>     set master 3
>>     if {![info exists SM($master)]} {
>>         inform "-I-ibdiagnet:bad.sm.header"
>>         inform "-E-ibdiagnet:no.SM"
>>     }
>>     return 0
>> }
>>
>>
>> This simply checks whether there is a FM instance at all.  If there 
>> is none, then that constitutes an error.  However, multiple FM instances
>> should not create an error.
>
> This check in ibdiagnet is supposed to report an error
> in case of more than one *master* SM in the subnet.
> In some cases it may happen, so the check is valid.
> However, I think that only master SM can get into that
> SM list, so either you really have a problem with
> two master SMs in subnet, or there is a bug in ibdiagnet
> and somehow it included non-master SM in that list.
>
> -- Yevgeny
>
>>
>> Thanx,
>> Matthias
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>
Hi,

The ibdiagnet is a diagnostic tool. Its goal is to report issues that 
"should work" on a good IB fabric.
Therefore, it does some checks that may seem superfluous, such as:
- Multiple SM masters
- Duplicated device GUIDs
- Credit loops in switch connectivity
etc.

A key feature in checking these issues, is to rely as little as possible 
on the correctness of other tools.

Regarding the SM check:

Only the number of Master SMs in (state 3) is checked and reported as error.
See file /tmp/ibdiagnet.sm for a list of the SMs (and their priority / 
state) in the fabric.

So I guess there is an issue with the SM in your setup  (2 Masters in 1 
subnet) - Please
send the relevant info to the opensm maintainer.

Thanks,
ORen.












More information about the general mailing list