[Users] HP BLc QLogic 4X QDR IB Switch oddness

Hal Rosenstock hal.rosenstock at gmail.com
Tue May 28 08:02:00 PDT 2013


On Tue, May 28, 2013 at 8:19 AM, Andrei Mikhailovsky <andrei at arhont.com>wrote:

> Hello guys,
>
> I've just rebooted one of the server which last time took over 10 hours to
> get an ACTIVE port state. It has been 39 minutes since the reboot and so
> far, no link ((
>
> I do see a rather large number of SymbolErrorCounter, which doesn't seems
> to change with the reset as you can see from the perfqery command below:
>
> perfquery -r 2 18
> # Port counters: Lid 2 port 18 (CapMask: 0x500)
> PortSelect:......................18
> CounterSelect:...................0x0000
> SymbolErrorCounter:..............65535
>

That's a max'd out symbol errors counter and IB counters are sticky at max
rather than rollover like IETF counters. To see if it's still changing, it
needs to be reset. You can use -r option of perfquery to do this.


> LinkErrorRecoveryCounter:........0
> LinkDownedCounter:...............0
> PortRcvErrors:...................0
> PortRcvRemotePhysicalErrors:.....0
> PortRcvSwitchRelayErrors:........0
> PortXmitDiscards:................0
> PortXmitConstraintErrors:........0
> PortRcvConstraintErrors:.........0
> CounterSelect2:..................0x00
> LocalLinkIntegrityErrors:........0
> ExcessiveBufferOverrunErrors:....0
> VL15Dropped:.....................0
> PortXmitData:....................0
> PortRcvData:.....................0
> PortXmitPkts:....................0
> PortRcvPkts:.....................0
>
> The card's State is shown as DOWN, but the Physical State changes from 2:
> Polling to 4: PortConfigurationTraining to 3: Disabled to 16: <unknown>.
>
> OpenSM logs show the following:
>
> May 28 12:31:06 097874 [26274700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:67 (Mcast group deleted) from LID:1
> GID:ff12:601b:ffff::202
> May 28 12:31:06 098319 [27276700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B11:
> method = SubnAdmSet, scope_state = 0x1, component mask =
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID:
> ff12:601b:ffff::16 from port 0x001175000079669a (arh-cloud2 HCA-1)
> May 28 12:31:08 936842 [2DA83700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:67 (Mcast group deleted) from LID:1
> GID:ff12:601b:ffff::1:ff79:669a
> May 28 12:31:15 955330 [2A27C700] 0x01 -> log_trap_info: Received Generic
> Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:2
> TID:0x0000008000000026
> May 28 12:31:15 955471 [2A27C700] 0x02 -> log_notice: Reporting Generic
> Notice type:1 num:128 (Link state change) from LID:2
> GID:fe80::6:6a00:f000:24d
> May 28 12:31:15 960718 [24A71700] 0x02 -> log_notice: Reporting Generic
> Notice type:3 num:65 (GID out of service) from LID:1
> GID:fe80::11:7500:79:669a
> May 28 12:31:15 960860 [24A71700] 0x02 -> drop_mgr_remove_port: Removed
> port with GUID:0x001175000079669a LID range [3, 3] of node:arh-cloud2 HCA-1
> May 28 12:31:15 960904 [24A71700] 0x01 -> osm_prtn_make_partitions:
> Partition configuration /etc/opensm/partitions.conf is not accessible (No
> such file or directory)
>
> which seem to be entries corresponding to the server going to reboot. I do
> not see anything on the opensm side to indicate that the server
> (arh-cloud2) is trying to negotiate the link.
>

So this is the link state change trap when that port goes DOWN but since
the port never becomes LinkUp/Init, there's never an additional trap 128
from the switch for that.

-- Hal

>
> Andrei
> ------------------------------
> *From: *"Hal Rosenstock" <hal.rosenstock at gmail.com>
> *To: *"Andrei Mikhailovsky" <andrei at arhont.com>
> *Cc: *"John Valdes" <valdes at anl.gov>, users at lists.openfabrics.org
> *Sent: *Sunday, 26 May, 2013 2:47:55 PM
>
> *Subject: *Re: [Users] HP BLc QLogic 4X QDR IB Switch oddness
>
>
>
> On Sun, May 26, 2013 at 9:09 AM, Andrei Mikhailovsky <andrei at arhont.com>wrote:
>
>>
>>
>> ------------------------------
>> *From: *"John Valdes" <valdes at anl.gov>
>> *To: *"Andrei Mikhailovsky" <andrei at arhont.com>
>> *Cc: *users at lists.openfabrics.org
>> *Sent: *Saturday, 25 May, 2013 2:19:40 AM
>>
>> *Subject: *Re: [Users] HP BLc QLogic 4X QDR IB Switch oddness
>>
>> Andrei Mikhailovsky wrote:
>> > John Valdes wrote:
>> > > What's the physical topology of the IB network between the blade
>> > > servers and the switch?
>> >
>> > AM: I am not really sure. The servers do have the IB mezzanine card and
>> from what i've read it is a pci-e card. I am unsure how the blade servers
>> are connected to the switch. I guess it's an internal HP/QLogic
>> interconnect.
>>
>> Found some docs on Intel's website at:
>> http://www.intel.com/p/en_US/support/highlights/network/ts-fbs12100
>> From that, it looks like the topology is very simple; the switch
>> installs in a slot in the blade chassis, and it looks like it has 16
>> internal (through the backplane of the chassis) IB connections, one
>> to each blade server in the chassis, plus 16 external QSFP ports.
>>
>>
>> AM: yeah, that pretty much sums up the switch
>>
>>
>>
>>
>> I was thinking maybe there was something odd in the topology that
>> was causing the subnet manager to fail to negotiate link properly w/
>> the blade servers.  It doesn't sound like that's the case.
>>
>>
>> AM: I do not see any logs on the SM side when the port state changes. The
>> only logs I see is when the port becomes Active, I see that in the logs and
>> no errors before or prior to that
>>
>>
>
> By port state, do you mean port state or port physical state ? Note that
> there's some relationship between the two but port physical state can
> change without port state changing. In the opensm log, you should see trap
> 128 when link (port) state changes. If the port/link is constantly
> (re)negotiating and doesn't get to LinkUp (port physical state)/Init (port
> state), you won't see this in the log. If port state is truly changing, you
> should see this trap in the opensm log.
>
> -- Hal
>
>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130528/e1a8e334/attachment.html>


More information about the Users mailing list