[ofa-general] ibdiagnet FM master / standby error report
Oren Kladnitsky
orenk at dev.mellanox.co.il
Thu Aug 28 08:36:01 PDT 2008
Yevgeny Kliteynik wrote:
> Hi Matthias,
>
> Matthias Blankenhaus wrote:
>> Howdy !
>>
>> I have noticed that ibdiagnet reports an error when using a
>> master / standby FM configuration. I am using OFED-1.3.1.
>>
>> Here it goes:
>>
>> # ibdiagnet ....
>> -I---------------------------------------------------
>> -I- Bad Fabric SM Info
>> -I---------------------------------------------------
>> -E- Found more then one master SM in the discover fabric
>> r1lead/P1 priority:15
>> r2lead/P1 priority:0
>> ....
>> -I- Stages Status Report:
>> STAGE Errors Warnings
>> Bad GUIDs/LIDs Check 0 0
>> Link State Active Check 0 0
>> SM Info Check 1 0
>> Performance Counters Report 0 6
>> Partitions Check 0 0
>> IPoIB Subnets Check 0 0
>>
>>
>> This is incorrect as we have only one master namely r1lead. r2lead
>> is a standby only.
>>
>>
>> The culprit for this problem seems to be this file:
>>
>> /usr/lib64/ibdiagnet1.2/ibdebug.tcl
>>
>> Here is the if stmt that creates the problem:
>> ...
>> 2988 proc CheckSM {} {
>> 2989 global SM G
>> 2990 set master 3
>> 2991 if {![info exists SM($master)]} {
>> 2992 inform "-I-ibdiagnet:bad.sm.header"
>> 2993 inform "-E-ibdiagnet:no.SM"
>> 2994 } else {
>> 2995 if {[llength $SM($master)] != 1} {
>> ==> ^^^^
>> 2996 inform "-I-ibdiagnet:bad.sm.header"
>> 2997 inform "-E-ibdiagnet:many.SM.master"
>> 2998 foreach element $SM($master) {
>> 2999 set tmpDirectPath [lindex $element 0]
>> 3000 set nodeName [DrPath2Name $tmpDirectPath -port
>> [GetEntryPort $tmpDirectPath]]
>> 3001 if { $tmpDirectPath == "" } {
>> ....
>>
>> It appears that this code does not factor in the priority of an
>> individual FM. It simply counts the FM instances and if the resulting
>> number not equals 1, then this tools indicates an error.
>> From studying the OFED code
>> (osm_state_mgr.h::osm_sm_is_greater_than()) it is clear that, even if
>> two FM instances for the same fabric have an identical priority,
>> there is always only one winner by resolving the tie via guids.
>>
>> Here is the relevant OFED code:
>>
>> static inline boolean_t
>> osm_sm_is_greater_than(IN const uint8_t l_priority,
>> IN const ib_net64_t l_guid,
>> IN const uint8_t r_priority, IN const ib_net64_t r_guid)
>> {
>> if (l_priority > r_priority) {
>> return (TRUE);
>> } else {
>> if (l_priority == r_priority) {
>> if (cl_ntoh64(l_guid) < cl_ntoh64(r_guid)) {
>> return (TRUE);
>> }
>> }
>> }
>> return (FALSE);
>> }
>>
>>
>> Thus, in my opinion the check against number of FM instances in
>> ibdebug.tcl is superfluous. And indeed, removing the check resolves
>> the issue. The new version of the above func looks like this:
>>
>>
>> proc CheckSM {} {
>> global SM G
>> set master 3
>> if {![info exists SM($master)]} {
>> inform "-I-ibdiagnet:bad.sm.header"
>> inform "-E-ibdiagnet:no.SM"
>> }
>> return 0
>> }
>>
>>
>> This simply checks whether there is a FM instance at all. If there
>> is none, then that constitutes an error. However, multiple FM instances
>> should not create an error.
>
> This check in ibdiagnet is supposed to report an error
> in case of more than one *master* SM in the subnet.
> In some cases it may happen, so the check is valid.
> However, I think that only master SM can get into that
> SM list, so either you really have a problem with
> two master SMs in subnet, or there is a bug in ibdiagnet
> and somehow it included non-master SM in that list.
>
> -- Yevgeny
>
>>
>> Thanx,
>> Matthias
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
>
Hi,
The ibdiagnet is a diagnostic tool. Its goal is to report issues that
"should work" on a good IB fabric.
Therefore, it does some checks that may seem superfluous, such as:
- Multiple SM masters
- Duplicated device GUIDs
- Credit loops in switch connectivity
etc.
A key feature in checking these issues, is to rely as little as possible
on the correctness of other tools.
Regarding the SM check:
Only the number of Master SMs in (state 3) is checked and reported as error.
See file /tmp/ibdiagnet.sm for a list of the SMs (and their priority /
state) in the fabric.
So I guess there is an issue with the SM in your setup (2 Masters in 1
subnet) - Please
send the relevant info to the opensm maintainer.
Thanks,
ORen.
More information about the general
mailing list