[ewg] PortRcvErrors in back-to-back IB connections
pavan.tc
pavan.tc at gmail.com
Mon Aug 19 00:10:26 PDT 2013
>
> > perfquery output before ib_send_bw test:
> >
> > # Port counters: Lid 2 port 1
> > PortSelect:......................1
> > CounterSelect:...................0x1400
> > SymbolErrorCounter:..............15814
> > LinkErrorRecoveryCounter:........255
> > LinkDownedCounter:...............0
> > PortRcvErrors:...................5403
> > PortRcvRemotePhysicalErrors:.....0
> > PortRcvSwitchRelayErrors:........0
> > PortXmitDiscards:................0
> > PortXmitConstraintErrors:........0
> > PortRcvConstraintErrors:.........0
> > CounterSelect2:..................0x00
> > LocalLinkIntegrityErrors:........0
> > ExcessiveBufferOverrunErrors:....0
> > VL15Dropped:.....................0
> > PortXmitData:....................2925583200
> > PortRcvData:.....................145715607
> > PortXmitPkts:....................10975597
> > PortRcvPkts:.....................8191613
> > PortXmitWait:....................7570
> >
> >
> > Run Ib_send_bw test:
> > [root at vsanqa7 ~]# ib_send_bw
> > ------------------------------------------------------------------
> > Send BW Test
> > Number of qps : 1
> > Connection type : RC
> > RX depth : 600
> > CQ Moderation : 50
> > Mtu : 2048B
> > Link type : IB
> > Max inline data : 0B
> > rdma_cm QPs : OFF
> > Data ex. method : Ethernet
> > ------------------------------------------------------------------
> > local address: LID 0x02 QPN 0xde1b PSN 000000
> > remote address: LID 0x01 QPN 0x64004a PSN 000000
> > ------------------------------------------------------------------
> > #bytes #iterations BW peak[MB/sec] BW average[MB/sec]
> > 65536 1000 -nan 42.71
> >
> > Which is too low
> >
> > Perfquery after ib_send_bw test:
> >
> > # Port counters: Lid 2 port 1
> > PortSelect:......................1
> > CounterSelect:...................0x1400
> > SymbolErrorCounter:..............20750
>
> Are symbol errors increasing ?
>
>
Yes.
>From the outputs above:
Before the ib_send_bw test, the symbol error counter reads as below:
> SymbolErrorCounter:..............15814
Post test, the following is the counter value:
> SymbolErrorCounter:..............20750
> LinkErrorRecoveryCounter:........255
>
> Could it be that your link goes through error recovery as indicated by
> this counter being max'd out ?
>
> Can you clear this counter and see if it increments ?
>
I will try this the next time I hit the issue.
[...]
I suspect the link is retraining due to minor errors over threshold or
> major errors.
>
> Can you try some other known good cable ?
>
Will do that and will report if we continue to see issues. But the fact
that the
problems disappear everytime I reload the modules suggests it might be some
software state that is getting messed, but I am only guessing. Also, it is
not
just one pair of systems that is seeing this problem. We have witnessed it
between atleast 3 pairs of systems which reduces the likelihood of this
being
a cable problem.
Pavan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130819/c353809b/attachment.html>
More information about the ewg
mailing list