[Users] PortXmitWait?
Hal Rosenstock
hal.rosenstock at gmail.com
Thu Mar 13 04:36:59 PDT 2014
Some causes of congestion are: slow receiver, many to one communication,
and "poor" fat tree topology.
On the last item, are all links in the subnet same speed and width ? How
many links are used going up the fat tree to the next rank ?
Are all end nodes connected to rank 2 or are any connected to higher rank ?
Are there any "combined" nodes ? By this I mean, some device which is more
than just single switch or CA. If so, what are they and where do they live
in the topology ?
On Wed, Mar 12, 2014 at 11:50 PM, Hal Rosenstock
<hal.rosenstock at gmail.com>wrote:
> By the fact that you didn't mention PortXmitDiscards, does it mean that
> these are 0 ? Assuming so, PortXmitWait is indicating there is some
> congestion but it has not risen to the level of dropping packets. It's the
> rate of increase of the XmitWait counter that's important rather than the
> absolute number so if you want to chase this, the focus should be on the
> ports most congested.
>
> Since the old tool didn't report XmitWait counters, it's hard to know
> whether this is the same as before or not unless you did this manually.
>
> Was the routing previously fat tree ? Are there any other fat tree related
> log messages in the OpenSM log ? Is there any fat tree configuration of
> compute and/or I/O nodes ?
>
> Any idea on what is the traffic pattern ? Are you running MPI ?
>
> -- Hal
>
>
> On Wed, Mar 12, 2014 at 8:17 PM, Florent Parent <
> florent.parent at calculquebec.ca> wrote:
>
>>
>> Hello IB users,
>>
>> We recently migrated our opensm from 3.2.6 to 3.3.17. In this upgrade, we
>> moved to CentOS6.5 with the stock RDMA and infiniband-diags_1.5.12-5., and
>> running opensm 3.3.17. Routing is FatTree:
>> General fabric topology info
>> ============================
>> - FatTree rank (roots to leaf switches): 3
>> - FatTree max switch rank: 2
>> - Fabric has 966 CAs, 966 CA ports (603 of them CNs), 186 switches
>> - Fabric has 36 switches at rank 0 (roots)
>> - Fabric has 64 switches at rank 1
>> - Fabric has 86 switches at rank 2 (86 of them leafs)
>>
>> Now to the question: ibqueryerrors 1.5.12 is reporting high PortXmitWait
>> values throughout the fabric. We did not see this counter before (it was
>> not reported by the older ibqueryerrors.pl)
>>
>> To give an idea of the scale of the counters, here's a capture of
>> ibqueryerrors --data on one specific I4 switch, 10 seconds after clearing
>> the counters (-k -K):
>>
>> GUID 0x21283a83b30050 port 4: PortXmitWait == 2932676 PortXmitData ==
>> 90419517 (344.923MB) PortRcvData == 1526963011 (5.688GB)
>> GUID 0x21283a83b30050 port 5: PortXmitWait == 3110105 PortXmitData ==
>> 509580912 (1.898GB) PortRcvData == 13622 (53.211KB)
>> GUID 0x21283a83b30050 port 6: PortXmitWait == 8696397 PortXmitData ==
>> 480870802 (1.791GB) PortRcvData == 17067 (66.668KB)
>> GUID 0x21283a83b30050 port 7: PortXmitWait == 1129568 PortXmitData ==
>> 126483825 (482.497MB) PortRcvData == 24973385 (95.266MB)
>> GUID 0x21283a83b30050 port 8: PortXmitWait == 29021 PortXmitData ==
>> 19444902 (74.176MB) PortRcvData == 84447725 (322.143MB)
>> GUID 0x21283a83b30050 port 9: PortXmitWait == 4945130 PortXmitData ==
>> 161911244 (617.642MB) PortRcvData == 27161 (106.098KB)
>> GUID 0x21283a83b30050 port 10: PortXmitWait == 16795 PortXmitData ==
>> 35572510 (135.698MB) PortRcvData == 681174731 (2.538GB)
>> ... (this goes on for every active ports)
>>
>> We are not observing any failures, so I suspect that I need help to
>> interpret these numbers. Do I need to be worried?
>>
>> Cheers,
>> Florent
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20140313/79a0a067/attachment.html>
More information about the Users
mailing list