[ofa-general] ibcheckerrors give error 5691 within OFED 1.3.1

Hal Rosenstock hal.rosenstock at gmail.com
Wed Sep 17 07:46:54 PDT 2008


Hi,

On Wed, Sep 17, 2008 at 4:25 AM, Wen Hao Wang <wangwhao at cn.ibm.com> wrote:
> Hi all:
>
> I had one IB cluster with eight IBM HS21 blades, mixed with RHEL5.2 Server
> and SLES10 SP2. All of them connected to one IB switch. opensm was running
> as subnet manager on one blade. Command ibcheckerrors finished smoothly.
> Last week I got another eight IBM LS21 blades connected to another IB
> switch. But after I connected two switches and turned on all the IB adapters
> on new blades, ibcheckerrors gave error message:
>
> [root at gaia-07 ~]# ibcheckerrors
> #warn: counter RcvErrors = 5691 (threshold 10) lid 3 port 1
> Error check on lid 3 (gaia-07 HCA-1) port 1: FAILED
>
> ## Summary: 19 nodes checked, 0 bad nodes found
> ## 46 ports checked, 1 ports have errors beyond threshold
> [root at gaia-07 ~]# ibv_devinfo
> hca_id: mlx4_0
> fw_ver: 2.3.000
> node_guid: 0002:c903:0001:3370
> sys_image_guid: 0002:c903:0001:3373
> vendor_id: 0x02c9
> vendor_part_id: 25418
> hw_ver: 0xA0
> board_id: IBM08A0000001
> phys_port_cnt: 2
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 15
> port_lid: 3
> port_lmc: 0x00
>
> port: 2
> state: PORT_DOWN (1)
> max_mtu: 2048 (4)
> active_mtu: 2048 (4)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> [root at gaia-07 ~]# ibcheckport 3 1
> [root at gaia-07 ~]# echo $?
> 0
>
> I had closed the embeded subnet manager on two IB switches. The issue always
> exist, even after I change subnet manager location to another machine. ib0
> of machine gaia-07 can communicate with other machines each other. All
> installed IB adapters are ConnectX 4xSDR. Both switches are Topspin
> Switches. Will anyone give some advice about this issue? Thanks in advance!

counter RcvErrors = 5691 is indicating the value of
PortCounters:RcvErrors. Per IBA section 16.1.3.5, it includes:
• Local physical errors (ICRC, VCRC, LPCRC, and all physical
errors that cause entry into the BAD PACKET or BAD PACKET
DISCARD states of the packet receiver state machine)
• Malformed data packet errors (LVer, length, VL)
• Malformed link packet errors (operand, length, VL)
• Packets discarded due to buffer overrun

Those errors may have occurred when you plugged in the additional
nodes. You might want to clear the errors first and then see if they
are continually increasing or stable.

-- Hal

>
> Wen Hao Wang
> Email: wangwhao at cn.ibm.com
>
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>



More information about the general mailing list