[ofa-general] OpenSM "Dead end on path to LID"
Yevgeny Kliteynik
kliteyn at dev.mellanox.co.il
Wed Jul 23 04:29:13 PDT 2008
Nathan Dauchy wrote:
>>> I believe MAXSMPS was originally set to 0 (unlimited), based on
>>> duplicating the config file from an older SM setup. I reduced it to 32
>>> when we saw some IB errors on standalone System B. I'm afraid I don't
>>> have documentation on what those problems were, but I don't recall
>>> seeing the exact same symptoms. I think it was MAD timeout error
>>> messages that prompted me to change the MAXSMPS value.
>> You probably have the timeout errors in the osm log now as well.
>> These errors are followed by the "unknown remote side" messages,
>> which means that osm didn't get response to some MADs (timeout).
>>
>>> We will be able to test any fixes during a scheduled system downtime on
>>> 7/24. At that point, do you recommend trying MAXSMPS=4?
>> In ofed 1.3.1, the default conf file value for maxsmps should be 4.
>> If you're using 0, then I'm pretty sure that this is what causing
>> the problems. Try 4 - it should be fine.
>
> We went ahead and tried both MAXSMPS=4 and MAXSMPS=1. The symptoms did
> not improve with all the nodes booted. :(
>
> For the record, here is exactly how opensm is running now:
>
> # ps uaxw | grep open
> root 23112 1.0 0.1 288432 16732 ? Sl 18:10 0:07
> /opt/ofed/1.3.1/sbin/opensm -maxsmps 1 -t 600 -f /var/run/osm/osm.log -R
> updn -g 0 --honor_guid2lid
Please try running opensm with a default routing (w/o '-R updn').
Just trying to understand if this is a routing or discovery issue.
Also, where does the opensm run?
-- Yevgeny
>>> If that doesn't work, what else should we be prepared to try, or what
>>> other debugging information would be helpful to gather?
>>>
>>> For the record, other steps we are considering:
>>> * Latest OpenSM code
>>> * Upgrading firmware on all IB switches
>>> * Changing topology to remove the 12X links (ugh!)
>> Hope that there won't be a need for other steps.
>
> To try to take the 12X links out of the equation, we replaced the (3)
> "Subtree C" spine switches with ones burned with the default 24*4X .INI
> file, and removed 2 cables from each 3-cable bundle. The "Subtree C"
> Clos/Edge switches and Root/Aggregation switches were still configured
> as 12X links, but according to "ibdiagnet", they negotiated down to 4X.
>
> (I can furnish a nicer diagram of our IB tree off-list if anyone would
> like to take a look at it.)
>
>
> Any other things we can try?
>
>
> Thanks again,
> Nathan
>
More information about the general
mailing list