[Users] PortXmitWait?

Mon Mar 17 12:07:23 PDT 2014

Status 0x1c to SwitchInfo means that the switch SMA is rejecting some set
from the SM. Can you reconfigure SM to not use mfttop by changing/adding
the following in the config file:

# Use SwitchInfo:MulticastFDBTop if advertised in PortInfo:CapabilityMask
use_mfttop FALSE

and SIGHUP'ing OpenSM. I'm hoping that will remove these errors in the log.

What do smpquery -D 0,1,33,30,28 pi 0 and smpquery -D 0,1,33,30,28 si say ?

Which switch type is it so I can correlate it to the FW versions.

I'm not sure about the effect of those "sibling" links in terms of the fat
tree routing.

On Thu, Mar 13, 2014 at 10:09 AM, Florent Parent <
florent.parent at calculquebec.ca> wrote:

>
>
>
> On Wed, Mar 12, 2014 at 11:50 PM, Hal Rosenstock <hal.rosenstock at gmail.com
> > wrote:
>
>> By the fact that you didn't mention PortXmitDiscards, does it mean that
>> these are 0 ? Assuming so, PortXmitWait is indicating there is some
>> congestion but it has not risen to the level of dropping packets. It's the
>> rate of increase of the XmitWait counter that's important rather than the
>> absolute number so if you want to chase this, the focus should be on the
>> ports most congested.
>>
>
> Yes, most are 0. 2-3 ports have XmitDiscards, but these are pointing to
> nodes in maintenance with known issues.
>
>
>>
>> Since the old tool didn't report XmitWait counters, it's hard to know
>> whether this is the same as before or not unless you did this manually.
>>
>> Was the routing previously fat tree ?
>>
>
> Yes
> Here's a PDF of the physical topology:
> https://dl.dropboxusercontent.com/u/2292440/CQ-UL_IB_topology.pdf
>
>
>> Are there any other fat tree related log messages in the OpenSM log ?
>>
>
> Nothing specific to Fat Tree. Some links going up or down (node
> maintenance). But there are a lot of MAD errors from a SwitchInfo request:
>
> Mar 13 09:50:04 909147 [4FAFC700] 0x01 -> log_rcv_cb_error: ERR 3111:
> Received MAD with error status = 0x1C
>                         SubnGetResp(SwitchInfo), attr_mod 0x0, TID
> 0x73c86e46
>                         Initial path: 0,1,33,30,28 Return path:
> 0,10,32,13,28
>
> 80 of these messages occur periodically, filling the logs. smpquery on the
> paths shows that these all point to the Sun QNEM switches (80 I4 chips). I
> did find a reference in the linux RDMA list about this:
> http://permalink.gmane.org/gmane.linux.drivers.rdma/7988. I assume that
> the switch is not reporting it capabilities correctly. Can this have an
> impact?
>
> Is there any fat tree configuration of compute and/or I/O nodes ?
>>
>
> We're specifying the root_guid and cn_guid files in opensm.conf:
> root_guid_file /etc/rdma/guids.txt
> cn_guid_file /etc/rdma/cn-guids.txt
>
> We are not using the I/O nodes configuration
>
>
>> Any idea on what is the traffic pattern ? Are you running MPI ?
>>
>
> We have Lustre file systems over IB and MPI jobs sharing the same IB
> network. When I gathered the counters, most of the compute were busy.
>
> Thanks
> Florent
>
>
>>
>> -- Hal
>>
>>
>> On Wed, Mar 12, 2014 at 8:17 PM, Florent Parent <
>> florent.parent at calculquebec.ca> wrote:
>>
>>>
>>> Hello IB users,
>>>
>>> We recently migrated our opensm from 3.2.6 to 3.3.17. In this upgrade,
>>> we moved to CentOS6.5 with the stock RDMA and infiniband-diags_1.5.12-5.,
>>> and running opensm 3.3.17. Routing is FatTree:
>>> General fabric topology info
>>> ============================
>>> - FatTree rank (roots to leaf switches): 3
>>> - FatTree max switch rank: 2
>>> - Fabric has 966 CAs, 966 CA ports (603 of them CNs), 186 switches
>>> - Fabric has 36 switches at rank 0 (roots)
>>> - Fabric has 64 switches at rank 1
>>> - Fabric has 86 switches at rank 2 (86 of them leafs)
>>>
>>> Now to the question: ibqueryerrors 1.5.12 is reporting high PortXmitWait
>>> values throughout the fabric. We did not see this counter before (it was
>>> not reported by the older ibqueryerrors.pl)
>>>
>>> To give an idea of the scale of the counters, here's a capture of
>>> ibqueryerrors --data on one specific I4 switch, 10 seconds after clearing
>>> the counters (-k -K):
>>>
>>> GUID 0x21283a83b30050 port 4:  PortXmitWait == 2932676  PortXmitData ==
>>> 90419517 (344.923MB)  PortRcvData == 1526963011 (5.688GB)
>>> GUID 0x21283a83b30050 port 5:  PortXmitWait == 3110105  PortXmitData ==
>>> 509580912 (1.898GB)  PortRcvData == 13622 (53.211KB)
>>> GUID 0x21283a83b30050 port 6:  PortXmitWait == 8696397  PortXmitData ==
>>> 480870802 (1.791GB)  PortRcvData == 17067 (66.668KB)
>>> GUID 0x21283a83b30050 port 7:  PortXmitWait == 1129568  PortXmitData ==
>>> 126483825 (482.497MB)  PortRcvData == 24973385 (95.266MB)
>>> GUID 0x21283a83b30050 port 8:  PortXmitWait == 29021  PortXmitData ==
>>> 19444902 (74.176MB)  PortRcvData == 84447725 (322.143MB)
>>> GUID 0x21283a83b30050 port 9:  PortXmitWait == 4945130  PortXmitData ==
>>> 161911244 (617.642MB)  PortRcvData == 27161 (106.098KB)
>>> GUID 0x21283a83b30050 port 10:  PortXmitWait == 16795  PortXmitData ==
>>> 35572510 (135.698MB)  PortRcvData == 681174731 (2.538GB)
>>> ... (this goes on for every active ports)
>>>
>>> We are not observing any failures, so I suspect that I need help to
>>> interpret these numbers. Do I need to be worried?
>>>
>>> Cheers,
>>> Florent
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at lists.openfabrics.org
>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20140317/d3a3e4ac/attachment.html>