[ofa-general] OpenSM "Dead end on path to LID"
Yevgeny Kliteynik
kliteyn at dev.mellanox.co.il
Fri Jul 18 14:54:14 PDT 2008
Hi Nathan,
Nathan Dauchy wrote:
> Hi Yevgeny, thanks for your response,
>
> Yevgeny Kliteynik wrote:
>> Hi Nathan,
>>
>> Nathan Dauchy wrote:
>>> Looking through osm.log a bit more, I also found a handful of errors
>>> like these:
>>>
>>> Jul 17 01:31:29 345329 [46E0A940] 0x01 ->
>>> __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for
>>> node 0x000002c900000048(MT47396 Infiniscale-III Mellanox Technologies)
>>> port 14. Adding to light sweep sampling list
>>> Jul 17 01:31:29 345340 [46E0A940] 0x01 -> Directed Path Dump of 4 hop
>>> path:
>>> Path = 0,1,20,7,15
>>> Jul 17 01:31:29 345381 [46E0A940] 0x01 ->
>>> __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side for
>>> node 0x000002c900000049(MT47396 Infiniscale-III Mellanox Technologies)
>>> port 15. Adding to light sweep sampling list
>>> Jul 17 01:31:29 345390 [46E0A940] 0x01 -> Directed Path Dump of 3 hop
>>> path:
>>> Path = 0,1,22,11
>>>
>>> Does that indicate a problem as well?
>> This explains why ibdiagnet couldn't query port counters.
>> OpenSM couldn't discover what's behind these ports, so it
>> didn't configure routing tables for the undiscovered nodes.
>> Ibdiagnet could discover them. It queries port counters by
>> their LIDs, but switches don't have these LIDs in the
>> routing tables.
>
> Thanks, that makes sense.
>
>>> Unknown remote side for node 0x000002c900000049(MT47396
>>> Infiniscale-III Mellanox Technologies) port 15
>> What is the remote side of this port? HCA? Switch?
>> If it's HCA, does its host run some heavy application?
>
> The remote side of that port is a "spine" switch. The remote side of
> the other example error message is a "clos"/"edge" switch.
>
> I guess I should provide some info on our IB network topology, since it
> may be a little unique and contributing to the problem...
>
> The Infiniband network consists of 3 layers of switches. All switches
> are 24-port Flextronics DDR switches (FX-X4300??). We can refer to the
> layers as "Edge" (clos), "Spine", and "Root" (aggregation). The network
> is divided into 3 "subtrees", joined by the (2) Root Aggregation
> switches. We can refer to the subtrees as A, B, and C.
>
> Subtree A:
> 22 Edge switches
> 17 SDR Hosts per Edge switch
> 6 Spine switches
> Each Edge switch has an uplink to each Spine
> Each Spine switch has an uplink to each Root
>
> Subtree B:
> 22 Edge switches
> 12 DDR Hosts per Edge switch
> 9 Spine switches
> Each Edge switch has an uplink to each Spine
> Each Spine switch has an uplink to each Root
>
> Subtree C:
> 4 Edge switches
> Edge switches are configured with 9 ports as 3 logical 12x links
> Up to 15 SDR/DDR Hosts per Edge switch
> 3 Spine switches
> Spines are configured with all 24 ports as 8 logical 12x links
> Each Edge switch has an uplink (3 cables) to each Spine
> Each Spine switch has an uplink (3 cables) to each Root
>
> Aggregation:
> 2 Root switches
> Configured with 9 physical ports as 3 logical 12x ports
> 6 links to Subtree A (each)
> 9 links to Subtree B
> 3 links (9 cables) to Subtree C
>
> The Flextronics switches are 24-port DDR switches (FX-X4300??) using
> Mellanox Part MTS2400 (Silicon MT47396). They are are burned with
> firmware version "fw-47396-1.0.0", using the "M24D0601A.INI" file, with
> changes only to the "[LinkWidthSupp]" section. We downloaded the
> firmware from: http://www.mellanox.com/support/switch_firmware_table.php
>
> So, the example "Unknown remote side" messages from above are:
> System B Edge -> System B Spine
> System A Spine -> System A Edge
>
>
>> I understand you already increased transaction time.
>> Please try limiting SMPs on the wire - in opensm.conf
>> file, set max_wire_smps to 1 (you probably have 4).
>> You can also run opensm with '-maxsmps 1' command line
>> argument.
>
> Interesting!
>
> I believe MAXSMPS was originally set to 0 (unlimited), based on
> duplicating the config file from an older SM setup. I reduced it to 32
> when we saw some IB errors on standalone System B. I'm afraid I don't
> have documentation on what those problems were, but I don't recall
> seeing the exact same symptoms. I think it was MAD timeout error
> messages that prompted me to change the MAXSMPS value.
You probably have the timeout errors in the osm log now as well.
These errors are followed by the "unknown remote side" messages,
which means that osm didn't get response to some MADs (timeout).
> We will be able to test any fixes during a scheduled system downtime on
> 7/24. At that point, do you recommend trying MAXSMPS=4?
In ofed 1.3.1, the default conf file value for maxsmps should be 4.
If you're using 0, then I'm pretty sure that this is what causing
the problems. Try 4 - it should be fine.
> (I assume the
> tradeoff of number of SMPs is discovery speed vs. stability. Yes?)
In short - yes. On big clusters there could be problems with
VL15 traffic overflowing VL15 buffers in the switches.
Increasing transaction time doesn't help, as the packets
are dropped, not delayed.
> If that doesn't work, what else should we be prepared to try, or what
> other debugging information would be helpful to gather?
>
> For the record, other steps we are considering:
> * Latest OpenSM code
> * Upgrading firmware on all IB switches
> * Changing topology to remove the 12X links (ugh!)
Hope that there won't be a need for other steps.
-- Yevgeny
>
> Thanks much,
> Nathan
>
More information about the general
mailing list