[Users] PortXmitWait?

Hal Rosenstock hal.rosenstock at gmail.com
Thu Mar 13 04:36:59 PDT 2014


Some causes of congestion are: slow receiver, many to one communication,
and "poor" fat tree topology.

On the last item, are all links in the subnet same speed and width ? How
many links are used going up the fat tree to the next rank ?

Are all end nodes connected to rank 2 or are any connected to higher rank ?

Are there any "combined" nodes ? By this I mean, some device which is more
than just single switch or CA. If so, what are they and where do they live
in the topology ?


On Wed, Mar 12, 2014 at 11:50 PM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:

> By the fact that you didn't mention PortXmitDiscards, does it mean that
> these are 0 ? Assuming so, PortXmitWait is indicating there is some
> congestion but it has not risen to the level of dropping packets. It's the
> rate of increase of the XmitWait counter that's important rather than the
> absolute number so if you want to chase this, the focus should be on the
> ports most congested.
>
> Since the old tool didn't report XmitWait counters, it's hard to know
> whether this is the same as before or not unless you did this manually.
>
> Was the routing previously fat tree ? Are there any other fat tree related
> log messages in the OpenSM log ? Is there any fat tree configuration of
> compute and/or I/O nodes ?
>
> Any idea on what is the traffic pattern ? Are you running MPI ?
>
> -- Hal
>
>
> On Wed, Mar 12, 2014 at 8:17 PM, Florent Parent <
> florent.parent at calculquebec.ca> wrote:
>
>>
>> Hello IB users,
>>
>> We recently migrated our opensm from 3.2.6 to 3.3.17. In this upgrade, we
>> moved to CentOS6.5 with the stock RDMA and infiniband-diags_1.5.12-5., and
>> running opensm 3.3.17. Routing is FatTree:
>> General fabric topology info
>> ============================
>> - FatTree rank (roots to leaf switches): 3
>> - FatTree max switch rank: 2
>> - Fabric has 966 CAs, 966 CA ports (603 of them CNs), 186 switches
>> - Fabric has 36 switches at rank 0 (roots)
>> - Fabric has 64 switches at rank 1
>> - Fabric has 86 switches at rank 2 (86 of them leafs)
>>
>> Now to the question: ibqueryerrors 1.5.12 is reporting high PortXmitWait
>> values throughout the fabric. We did not see this counter before (it was
>> not reported by the older ibqueryerrors.pl)
>>
>> To give an idea of the scale of the counters, here's a capture of
>> ibqueryerrors --data on one specific I4 switch, 10 seconds after clearing
>> the counters (-k -K):
>>
>> GUID 0x21283a83b30050 port 4:  PortXmitWait == 2932676  PortXmitData ==
>> 90419517 (344.923MB)  PortRcvData == 1526963011 (5.688GB)
>> GUID 0x21283a83b30050 port 5:  PortXmitWait == 3110105  PortXmitData ==
>> 509580912 (1.898GB)  PortRcvData == 13622 (53.211KB)
>> GUID 0x21283a83b30050 port 6:  PortXmitWait == 8696397  PortXmitData ==
>> 480870802 (1.791GB)  PortRcvData == 17067 (66.668KB)
>> GUID 0x21283a83b30050 port 7:  PortXmitWait == 1129568  PortXmitData ==
>> 126483825 (482.497MB)  PortRcvData == 24973385 (95.266MB)
>> GUID 0x21283a83b30050 port 8:  PortXmitWait == 29021  PortXmitData ==
>> 19444902 (74.176MB)  PortRcvData == 84447725 (322.143MB)
>> GUID 0x21283a83b30050 port 9:  PortXmitWait == 4945130  PortXmitData ==
>> 161911244 (617.642MB)  PortRcvData == 27161 (106.098KB)
>> GUID 0x21283a83b30050 port 10:  PortXmitWait == 16795  PortXmitData ==
>> 35572510 (135.698MB)  PortRcvData == 681174731 (2.538GB)
>> ... (this goes on for every active ports)
>>
>> We are not observing any failures, so I suspect that I need help to
>> interpret these numbers. Do I need to be worried?
>>
>> Cheers,
>> Florent
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20140313/79a0a067/attachment.html>


More information about the Users mailing list