[Users] IPoIB failing
Michael Robbert
mrobbert at mines.edu
Thu Apr 24 16:35:31 PDT 2014
I have a host that is running CentOS 6.4 with kernel
2.6.32-358.el6.x86_64 and the OFED stack that shipped with that kernel.
It has a Mellanox ConnectX-3 HCA which is configured with IPoIB. The
only thing this host is doing is routing IP packets to/from the IPoIB
interface to a bonded interface containing to 10Gbps Ethernet ports
(Solarflare NIC if that matters). The node has been running fine for
some time now while I've been preparing the systems on either side.
Recently we put this into production allowing real users to run jobs and
3 times in the past week the IPoIB interface has become unresponsive
around the time that we see this message:
Apr 24 15:39:09 ibrtr-ct2 kernel: ib0: failed send event (status=1,
wrid=13 vend_err 69)
In each case the Status is always 1, the wrid varies, and the vend_err
is always 69. No other errors are seen and it appears that lower level
IB functions still work fine. i.e. ibstatus is active and ibhosts sees
the host. Also ibchecknet doesn't show any problems. I have been able to
fix the problem by downing the interface and bringing it back up with
ifdown and ifup.
Has anybody seen these symptoms or know specifically what the error
message means? Any thoughts on where the problem lies or how to find
that out?
My next planned step is to upgrade the kernel to a later CentOS 6.4
kernel. If that doesn't help I may try replacing the HCA with an old
Infinihost III card. We have another one of these boxes in another
building that still has its old Infinihost III card and isn't seeing
this problem and it is seeing the other side of all of this traffic.
Thanks,
Mike Robbert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4003 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20140424/67ca9bc8/attachment.bin>
More information about the Users
mailing list