<div dir="ltr"><div><br></div>Hello IB users,<br><div><div><br></div><div>We recently migrated our opensm from 3.2.6 to 3.3.17. In this upgrade, we moved to CentOS6.5 with the stock RDMA and infiniband-diags_1.5.12-5., and running opensm 3.3.17. Routing is FatTree:</div>
<div><div>General fabric topology info</div><div>============================</div><div>- FatTree rank (roots to leaf switches): 3</div><div>- FatTree max switch rank: 2</div><div>- Fabric has 966 CAs, 966 CA ports (603 of them CNs), 186 switches</div>
<div>- Fabric has 36 switches at rank 0 (roots)</div><div>- Fabric has 64 switches at rank 1</div><div>- Fabric has 86 switches at rank 2 (86 of them leafs)</div></div><div><br></div><div>Now to the question: ibqueryerrors 1.5.12 is reporting high PortXmitWait values throughout the fabric. We did not see this counter before (it was not reported by the older <a href="http://ibqueryerrors.pl">ibqueryerrors.pl</a>)</div>
<div><br></div><div>To give an idea of the scale of the counters, here's a capture of ibqueryerrors --data on one specific I4 switch, 10 seconds after clearing the counters (-k -K):</div><div><br></div><div>GUID 0x21283a83b30050 port 4: PortXmitWait == 2932676 PortXmitData == 90419517 (344.923MB) PortRcvData == 1526963011 (5.688GB)</div>
<div>GUID 0x21283a83b30050 port 5: PortXmitWait == 3110105 PortXmitData == 509580912 (1.898GB) PortRcvData == 13622 (53.211KB)</div><div>GUID 0x21283a83b30050 port 6: PortXmitWait == 8696397 PortXmitData == 480870802 (1.791GB) PortRcvData == 17067 (66.668KB)</div>
<div>GUID 0x21283a83b30050 port 7: PortXmitWait == 1129568 PortXmitData == 126483825 (482.497MB) PortRcvData == 24973385 (95.266MB)</div><div>GUID 0x21283a83b30050 port 8: PortXmitWait == 29021 PortXmitData == 19444902 (74.176MB) PortRcvData == 84447725 (322.143MB)</div>
<div>GUID 0x21283a83b30050 port 9: PortXmitWait == 4945130 PortXmitData == 161911244 (617.642MB) PortRcvData == 27161 (106.098KB)</div><div>GUID 0x21283a83b30050 port 10: PortXmitWait == 16795 PortXmitData == 35572510 (135.698MB) PortRcvData == 681174731 (2.538GB)</div>
<div>... (this goes on for every active ports)</div><div><br></div><div>We are not observing any failures, so I suspect that I need help to interpret these numbers. Do I need to be worried? </div><div><br></div><div>Cheers,</div>
<div>Florent</div></div><div><br></div></div>