[ewg] PortRcvErrors in back-to-back IB connections

pavan.tc pavan.tc at gmail.com
Mon Aug 19 00:10:26 PDT 2013


>
> > perfquery output before ib_send_bw test:
> >
> > # Port counters: Lid 2 port 1
> > PortSelect:......................1
> > CounterSelect:...................0x1400
> > SymbolErrorCounter:..............15814
> > LinkErrorRecoveryCounter:........255
> > LinkDownedCounter:...............0
> > PortRcvErrors:...................5403
> > PortRcvRemotePhysicalErrors:.....0
> > PortRcvSwitchRelayErrors:........0
> > PortXmitDiscards:................0
> > PortXmitConstraintErrors:........0
> > PortRcvConstraintErrors:.........0
> > CounterSelect2:..................0x00
> > LocalLinkIntegrityErrors:........0
> > ExcessiveBufferOverrunErrors:....0
> > VL15Dropped:.....................0
> > PortXmitData:....................2925583200
> > PortRcvData:.....................145715607
> > PortXmitPkts:....................10975597
> > PortRcvPkts:.....................8191613
> > PortXmitWait:....................7570
> >
> >
> > Run Ib_send_bw test:
> > [root at vsanqa7 ~]# ib_send_bw
> > ------------------------------------------------------------------
> >                     Send BW Test
> >  Number of qps   : 1
> >  Connection type : RC
> >  RX depth        : 600
> >  CQ Moderation   : 50
> >  Mtu             : 2048B
> >  Link type       : IB
> >  Max inline data : 0B
> >  rdma_cm QPs   : OFF
> >  Data ex. method : Ethernet
> > ------------------------------------------------------------------
> >  local address: LID 0x02 QPN 0xde1b PSN 000000
> >  remote address: LID 0x01 QPN 0x64004a PSN 000000
> > ------------------------------------------------------------------
> >  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
> >  65536      1000           -nan               42.71
> >
> > Which is too low
> >
> > Perfquery after ib_send_bw test:
> >
> > # Port counters: Lid 2 port 1
> > PortSelect:......................1
> > CounterSelect:...................0x1400
> > SymbolErrorCounter:..............20750
>
> Are symbol errors increasing ?
>
>
Yes.

>From the outputs above:

Before the ib_send_bw test, the symbol error counter reads as below:
> SymbolErrorCounter:..............15814

Post test, the following is the counter value:
> SymbolErrorCounter:..............20750


> LinkErrorRecoveryCounter:........255
>
> Could it be that your link goes through error recovery as indicated by
> this counter being max'd out ?
>
> Can you clear this counter and see if it increments ?
>

I will try this the next time I hit the issue.

[...]

I suspect the link is retraining due to minor errors over threshold or
> major errors.
>
> Can you try some other known good cable ?
>

Will do that and will report if we continue to see issues. But the fact
that the
problems disappear everytime I reload the modules suggests it might be some
software state that is getting messed, but I am only guessing. Also, it is
not
just one pair of systems that is seeing this problem. We have witnessed it
between atleast 3 pairs of systems which reduces the likelihood of this
being
a cable problem.

Pavan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130819/c353809b/attachment.html>


More information about the ewg mailing list