[Users] HP BLc QLogic 4X QDR IB Switch oddness

Tue May 28 05:19:36 PDT 2013

Hello guys, 

I've just rebooted one of the server which last time took over 10 hours to get an ACTIVE port state. It has been 39 minutes since the reboot and so far, no link (( 

I do see a rather large number of SymbolErrorCounter, which doesn't seems to change with the reset as you can see from the perfqery command below: 

perfquery -r 2 18 
# Port counters: Lid 2 port 18 (CapMask: 0x500) 
PortSelect:......................18 
CounterSelect:...................0x0000 
SymbolErrorCounter:..............65535 
LinkErrorRecoveryCounter:........0 
LinkDownedCounter:...............0 
PortRcvErrors:...................0 
PortRcvRemotePhysicalErrors:.....0 
PortRcvSwitchRelayErrors:........0 
PortXmitDiscards:................0 
PortXmitConstraintErrors:........0 
PortRcvConstraintErrors:.........0 
CounterSelect2:..................0x00 
LocalLinkIntegrityErrors:........0 
ExcessiveBufferOverrunErrors:....0 
VL15Dropped:.....................0 
PortXmitData:....................0 
PortRcvData:.....................0 
PortXmitPkts:....................0 
PortRcvPkts:.....................0 

The card's State is shown as DOWN, but the Physical State changes from 2: Polling to 4: PortConfigurationTraining to 3: Disabled to 16: <unknown>. 

OpenSM logs show the following: 

May 28 12:31:06 097874 [26274700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:1 GID:ff12:601b:ffff::202 
May 28 12:31:06 098319 [27276700] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: ff12:601b:ffff::16 from port 0x001175000079669a (arh-cloud2 HCA-1) 
May 28 12:31:08 936842 [2DA83700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:67 (Mcast group deleted) from LID:1 GID:ff12:601b:ffff::1:ff79:669a 
May 28 12:31:15 955330 [2A27C700] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:2 TID:0x0000008000000026 
May 28 12:31:15 955471 [2A27C700] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:2 GID:fe80::6:6a00:f000:24d 
May 28 12:31:15 960718 [24A71700] 0x02 -> log_notice: Reporting Generic Notice type:3 num:65 (GID out of service) from LID:1 GID:fe80::11:7500:79:669a 
May 28 12:31:15 960860 [24A71700] 0x02 -> drop_mgr_remove_port: Removed port with GUID:0x001175000079669a LID range [3, 3] of node:arh-cloud2 HCA-1 
May 28 12:31:15 960904 [24A71700] 0x01 -> osm_prtn_make_partitions: Partition configuration /etc/opensm/partitions.conf is not accessible (No such file or directory) 

which seem to be entries corresponding to the server going to reboot. I do not see anything on the opensm side to indicate that the server (arh-cloud2) is trying to negotiate the link. 

Andrei 
----- Original Message -----

From: "Hal Rosenstock" <hal.rosenstock at gmail.com> 
To: "Andrei Mikhailovsky" <andrei at arhont.com> 
Cc: "John Valdes" <valdes at anl.gov>, users at lists.openfabrics.org 
Sent: Sunday, 26 May, 2013 2:47:55 PM 
Subject: Re: [Users] HP BLc QLogic 4X QDR IB Switch oddness 

On Sun, May 26, 2013 at 9:09 AM, Andrei Mikhailovsky < andrei at arhont.com > wrote: 

From: "John Valdes" < valdes at anl.gov > 
To: "Andrei Mikhailovsky" < andrei at arhont.com > 
Cc: users at lists.openfabrics.org 
Sent: Saturday, 25 May, 2013 2:19:40 AM 

Subject: Re: [Users] HP BLc QLogic 4X QDR IB Switch oddness 

Andrei Mikhailovsky wrote: 
> John Valdes wrote: 
> > What's the physical topology of the IB network between the blade 
> > servers and the switch? 
> 
> AM: I am not really sure. The servers do have the IB mezzanine card and from what i've read it is a pci-e card. I am unsure how the blade servers are connected to the switch. I guess it's an internal HP/QLogic interconnect. 

Found some docs on Intel's website at: 
http://www.intel.com/p/en_US/support/highlights/network/ts-fbs12100 
>From that, it looks like the topology is very simple; the switch 
installs in a slot in the blade chassis, and it looks like it has 16 
internal (through the backplane of the chassis) IB connections, one 
to each blade server in the chassis, plus 16 external QSFP ports. 

AM: yeah, that pretty much sums up the switch 

I was thinking maybe there was something odd in the topology that 
was causing the subnet manager to fail to negotiate link properly w/ 
the blade servers. It doesn't sound like that's the case. 

AM: I do not see any logs on the SM side when the port state changes. The only logs I see is when the port becomes Active, I see that in the logs and no errors before or prior to that 

By port state, do you mean port state or port physical state ? Note that there's some relationship between the two but port physical state can change without port state changing. In the opensm log, you should see trap 128 when link (port) state changes. If the port/link is constantly (re)negotiating and doesn't get to LinkUp (port physical state)/Init (port state), you won't see this in the log. If port state is truly changing, you should see this trap in the opensm log. 

-- Hal 

<blockquote>

_______________________________________________ 
Users mailing list 
Users at lists.openfabrics.org 
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users 

</blockquote>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20130528/51452071/attachment.html>