[ofa-general] OpenSM "Dead end on path to LID"

Nathan Dauchy Nathan.Dauchy at noaa.gov
Tue Jul 22 11:31:39 PDT 2008


Yevgeny, and anyone else with ideas on this problem,

Yevgeny Kliteynik wrote:
> Nathan Dauchy wrote:
>> Yevgeny Kliteynik wrote:
>>> I understand you already increased transaction time.
>>> Please try limiting SMPs on the wire - in opensm.conf
>>> file, set max_wire_smps to 1 (you probably have 4).
>>> You can also run opensm with '-maxsmps 1' command line
>>> argument.
>>
>> Interesting!
>>
>> I believe MAXSMPS was originally set to 0 (unlimited), based on
>> duplicating the config file from an older SM setup.  I reduced it to 32
>> when we saw some IB errors on standalone System B.  I'm afraid I don't
>> have documentation on what those problems were, but I don't recall
>> seeing the exact same symptoms.  I think it was MAD timeout error
>> messages that prompted me to change the MAXSMPS value.
> 
> You probably have the timeout errors in the osm log now as well.
> These errors are followed by the "unknown remote side" messages,
> which means that osm didn't get response to some MADs (timeout).
> 
>> We will be able to test any fixes during a scheduled system downtime on
>> 7/24.  At that point, do you recommend trying MAXSMPS=4?
> 
> In ofed 1.3.1, the default conf file value for maxsmps should be 4.
> If you're using 0, then I'm pretty sure that this is what causing
> the problems. Try 4 - it should be fine.

We went ahead and tried both MAXSMPS=4 and MAXSMPS=1.  The symptoms did
not improve with all the nodes booted. :(

For the record, here is exactly how opensm is running now:

# ps uaxw | grep open
root     23112  1.0  0.1 288432 16732 ?        Sl   18:10   0:07
/opt/ofed/1.3.1/sbin/opensm -maxsmps 1 -t 600 -f /var/run/osm/osm.log -R
updn -g 0 --honor_guid2lid


>> If that doesn't work, what else should we be prepared to try, or what
>> other debugging information would be helpful to gather?
>>
>> For the record, other steps we are considering:
>> * Latest OpenSM code
>> * Upgrading firmware on all IB switches
>> * Changing topology to remove the 12X links (ugh!)
> 
> Hope that there won't be a need for other steps.

To try to take the 12X links out of the equation, we replaced the (3)
"Subtree C" spine switches with ones burned with the default 24*4X .INI
file, and removed 2 cables from each 3-cable bundle.  The "Subtree C"
Clos/Edge switches and Root/Aggregation switches were still configured
as 12X links, but according to "ibdiagnet", they negotiated down to 4X.

(I can furnish a nicer diagram of our IB tree off-list if anyone would
like to take a look at it.)


Any other things we can try?


Thanks again,
Nathan



More information about the general mailing list