[ofa-general] OpenSM "Dead end on path to LID"
Nathan Dauchy
Nathan.Dauchy at noaa.gov
Tue Jul 22 11:31:39 PDT 2008
Yevgeny, and anyone else with ideas on this problem,
Yevgeny Kliteynik wrote:
> Nathan Dauchy wrote:
>> Yevgeny Kliteynik wrote:
>>> I understand you already increased transaction time.
>>> Please try limiting SMPs on the wire - in opensm.conf
>>> file, set max_wire_smps to 1 (you probably have 4).
>>> You can also run opensm with '-maxsmps 1' command line
>>> argument.
>>
>> Interesting!
>>
>> I believe MAXSMPS was originally set to 0 (unlimited), based on
>> duplicating the config file from an older SM setup. I reduced it to 32
>> when we saw some IB errors on standalone System B. I'm afraid I don't
>> have documentation on what those problems were, but I don't recall
>> seeing the exact same symptoms. I think it was MAD timeout error
>> messages that prompted me to change the MAXSMPS value.
>
> You probably have the timeout errors in the osm log now as well.
> These errors are followed by the "unknown remote side" messages,
> which means that osm didn't get response to some MADs (timeout).
>
>> We will be able to test any fixes during a scheduled system downtime on
>> 7/24. At that point, do you recommend trying MAXSMPS=4?
>
> In ofed 1.3.1, the default conf file value for maxsmps should be 4.
> If you're using 0, then I'm pretty sure that this is what causing
> the problems. Try 4 - it should be fine.
We went ahead and tried both MAXSMPS=4 and MAXSMPS=1. The symptoms did
not improve with all the nodes booted. :(
For the record, here is exactly how opensm is running now:
# ps uaxw | grep open
root 23112 1.0 0.1 288432 16732 ? Sl 18:10 0:07
/opt/ofed/1.3.1/sbin/opensm -maxsmps 1 -t 600 -f /var/run/osm/osm.log -R
updn -g 0 --honor_guid2lid
>> If that doesn't work, what else should we be prepared to try, or what
>> other debugging information would be helpful to gather?
>>
>> For the record, other steps we are considering:
>> * Latest OpenSM code
>> * Upgrading firmware on all IB switches
>> * Changing topology to remove the 12X links (ugh!)
>
> Hope that there won't be a need for other steps.
To try to take the 12X links out of the equation, we replaced the (3)
"Subtree C" spine switches with ones burned with the default 24*4X .INI
file, and removed 2 cables from each 3-cable bundle. The "Subtree C"
Clos/Edge switches and Root/Aggregation switches were still configured
as 12X links, but according to "ibdiagnet", they negotiated down to 4X.
(I can furnish a nicer diagram of our IB tree off-list if anyone would
like to take a look at it.)
Any other things we can try?
Thanks again,
Nathan
More information about the general
mailing list