[Users] PortXmitWait?

Florent Parent florent.parent at calculquebec.ca
Thu Mar 13 07:09:24 PDT 2014


On Wed, Mar 12, 2014 at 11:50 PM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:

> By the fact that you didn't mention PortXmitDiscards, does it mean that
> these are 0 ? Assuming so, PortXmitWait is indicating there is some
> congestion but it has not risen to the level of dropping packets. It's the
> rate of increase of the XmitWait counter that's important rather than the
> absolute number so if you want to chase this, the focus should be on the
> ports most congested.
>

Yes, most are 0. 2-3 ports have XmitDiscards, but these are pointing to
nodes in maintenance with known issues.


>
> Since the old tool didn't report XmitWait counters, it's hard to know
> whether this is the same as before or not unless you did this manually.
>
> Was the routing previously fat tree ?
>

Yes
Here's a PDF of the physical topology:
https://dl.dropboxusercontent.com/u/2292440/CQ-UL_IB_topology.pdf


> Are there any other fat tree related log messages in the OpenSM log ?
>

Nothing specific to Fat Tree. Some links going up or down (node
maintenance). But there are a lot of MAD errors from a SwitchInfo request:

Mar 13 09:50:04 909147 [4FAFC700] 0x01 -> log_rcv_cb_error: ERR 3111:
Received MAD with error status = 0x1C
                        SubnGetResp(SwitchInfo), attr_mod 0x0, TID
0x73c86e46
                        Initial path: 0,1,33,30,28 Return path:
0,10,32,13,28

80 of these messages occur periodically, filling the logs. smpquery on the
paths shows that these all point to the Sun QNEM switches (80 I4 chips). I
did find a reference in the linux RDMA list about this:
http://permalink.gmane.org/gmane.linux.drivers.rdma/7988. I assume that the
switch is not reporting it capabilities correctly. Can this have an impact?

Is there any fat tree configuration of compute and/or I/O nodes ?
>

We're specifying the root_guid and cn_guid files in opensm.conf:
root_guid_file /etc/rdma/guids.txt
cn_guid_file /etc/rdma/cn-guids.txt

We are not using the I/O nodes configuration


> Any idea on what is the traffic pattern ? Are you running MPI ?
>

We have Lustre file systems over IB and MPI jobs sharing the same IB
network. When I gathered the counters, most of the compute were busy.

Thanks
Florent


>
> -- Hal
>
>
> On Wed, Mar 12, 2014 at 8:17 PM, Florent Parent <
> florent.parent at calculquebec.ca> wrote:
>
>>
>> Hello IB users,
>>
>> We recently migrated our opensm from 3.2.6 to 3.3.17. In this upgrade, we
>> moved to CentOS6.5 with the stock RDMA and infiniband-diags_1.5.12-5., and
>> running opensm 3.3.17. Routing is FatTree:
>> General fabric topology info
>> ============================
>> - FatTree rank (roots to leaf switches): 3
>> - FatTree max switch rank: 2
>> - Fabric has 966 CAs, 966 CA ports (603 of them CNs), 186 switches
>> - Fabric has 36 switches at rank 0 (roots)
>> - Fabric has 64 switches at rank 1
>> - Fabric has 86 switches at rank 2 (86 of them leafs)
>>
>> Now to the question: ibqueryerrors 1.5.12 is reporting high PortXmitWait
>> values throughout the fabric. We did not see this counter before (it was
>> not reported by the older ibqueryerrors.pl)
>>
>> To give an idea of the scale of the counters, here's a capture of
>> ibqueryerrors --data on one specific I4 switch, 10 seconds after clearing
>> the counters (-k -K):
>>
>> GUID 0x21283a83b30050 port 4:  PortXmitWait == 2932676  PortXmitData ==
>> 90419517 (344.923MB)  PortRcvData == 1526963011 (5.688GB)
>> GUID 0x21283a83b30050 port 5:  PortXmitWait == 3110105  PortXmitData ==
>> 509580912 (1.898GB)  PortRcvData == 13622 (53.211KB)
>> GUID 0x21283a83b30050 port 6:  PortXmitWait == 8696397  PortXmitData ==
>> 480870802 (1.791GB)  PortRcvData == 17067 (66.668KB)
>> GUID 0x21283a83b30050 port 7:  PortXmitWait == 1129568  PortXmitData ==
>> 126483825 (482.497MB)  PortRcvData == 24973385 (95.266MB)
>> GUID 0x21283a83b30050 port 8:  PortXmitWait == 29021  PortXmitData ==
>> 19444902 (74.176MB)  PortRcvData == 84447725 (322.143MB)
>> GUID 0x21283a83b30050 port 9:  PortXmitWait == 4945130  PortXmitData ==
>> 161911244 (617.642MB)  PortRcvData == 27161 (106.098KB)
>> GUID 0x21283a83b30050 port 10:  PortXmitWait == 16795  PortXmitData ==
>> 35572510 (135.698MB)  PortRcvData == 681174731 (2.538GB)
>> ... (this goes on for every active ports)
>>
>> We are not observing any failures, so I suspect that I need help to
>> interpret these numbers. Do I need to be worried?
>>
>> Cheers,
>> Florent
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20140313/fc4b88d8/attachment.html>


More information about the Users mailing list