[ofa-general] OpenSM "Dead end on path to LID"

Yevgeny Kliteynik kliteyn at dev.mellanox.co.il
Wed Jul 23 04:29:13 PDT 2008


Nathan Dauchy wrote:

>>> I believe MAXSMPS was originally set to 0 (unlimited), based on
>>> duplicating the config file from an older SM setup.  I reduced it to 32
>>> when we saw some IB errors on standalone System B.  I'm afraid I don't
>>> have documentation on what those problems were, but I don't recall
>>> seeing the exact same symptoms.  I think it was MAD timeout error
>>> messages that prompted me to change the MAXSMPS value.
>> You probably have the timeout errors in the osm log now as well.
>> These errors are followed by the "unknown remote side" messages,
>> which means that osm didn't get response to some MADs (timeout).
>>
>>> We will be able to test any fixes during a scheduled system downtime on
>>> 7/24.  At that point, do you recommend trying MAXSMPS=4?
>> In ofed 1.3.1, the default conf file value for maxsmps should be 4.
>> If you're using 0, then I'm pretty sure that this is what causing
>> the problems. Try 4 - it should be fine.
> 
> We went ahead and tried both MAXSMPS=4 and MAXSMPS=1.  The symptoms did
> not improve with all the nodes booted. :(
> 
> For the record, here is exactly how opensm is running now:
> 
> # ps uaxw | grep open
> root     23112  1.0  0.1 288432 16732 ?        Sl   18:10   0:07
> /opt/ofed/1.3.1/sbin/opensm -maxsmps 1 -t 600 -f /var/run/osm/osm.log -R
> updn -g 0 --honor_guid2lid

Please try running opensm with a default routing (w/o '-R updn').
Just trying to understand if this is a routing or discovery issue.
Also, where does the opensm run?

-- Yevgeny

>>> If that doesn't work, what else should we be prepared to try, or what
>>> other debugging information would be helpful to gather?
>>>
>>> For the record, other steps we are considering:
>>> * Latest OpenSM code
>>> * Upgrading firmware on all IB switches
>>> * Changing topology to remove the 12X links (ugh!)
>> Hope that there won't be a need for other steps.
> 
> To try to take the 12X links out of the equation, we replaced the (3)
> "Subtree C" spine switches with ones burned with the default 24*4X .INI
> file, and removed 2 cables from each 3-cable bundle.  The "Subtree C"
> Clos/Edge switches and Root/Aggregation switches were still configured
> as 12X links, but according to "ibdiagnet", they negotiated down to 4X.
> 
> (I can furnish a nicer diagram of our IB tree off-list if anyone would
> like to take a look at it.)
> 
> 
> Any other things we can try?
> 
> 
> Thanks again,
> Nathan
> 




More information about the general mailing list