[Users] IPoIB failing

Coulter, Susan K skc at lanl.gov
Mon May 5 15:23:59 PDT 2014


We have many nodes performing this same task (basically acting as a router/media converter) that are running on ConnectX3 cards.

Some of them are CentOS 6.2, with a slightly older kernel.
Some of them are RHEL with the same kernel you have.

None of them are exhibiting this behavior.

Not sure if it helps, other than to say that ConnectX3 and that kernel should work together just fine.
The version of CentOS should not matter, as the error is coming from the kernel.

Since the fix is to down and up the interface - it seems that the kernel modules may be getting inserted out of order or something.

Once you down/up the interface, does it ever happen again on the same node ?


On Apr 24, 2014, at 5:35 PM, Michael Robbert <mrobbert at mines.edu> wrote:

> I have a host that is running CentOS 6.4 with kernel 2.6.32-358.el6.x86_64 and the OFED stack that shipped with that kernel. It has a Mellanox ConnectX-3 HCA which is configured with IPoIB. The only thing this host is doing is routing IP packets to/from the IPoIB interface to a bonded interface containing to 10Gbps Ethernet ports (Solarflare NIC if that matters). The node has been running fine for some time now while I've been preparing the systems on either side. Recently we put this into production allowing real users to run jobs and 3 times in the past week the IPoIB interface has become unresponsive around the time that we see this message:
> 
> Apr 24 15:39:09 ibrtr-ct2 kernel: ib0: failed send event (status=1, wrid=13 vend_err 69)
> 
> In each case the Status is always 1, the wrid varies, and the vend_err is always 69. No other errors are seen and it appears that lower level IB functions still work fine. i.e. ibstatus is active and ibhosts sees the host. Also ibchecknet doesn't show any problems. I have been able to fix the problem by downing the interface and bringing it back up with ifdown and ifup.
> 
> Has anybody seen these symptoms or know specifically what the error message means? Any thoughts on where the problem lies or how to find that out?
> My next planned step is to upgrade the kernel to a later CentOS 6.4 kernel. If that doesn't help I may try replacing the HCA with an old Infinihost III card. We have another one of these boxes in another building that still has its old Infinihost III card and isn't seeing this problem and it is seeing the other side of all of this traffic.
> 
> Thanks,
> Mike Robbert
> 
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/users

====================================

Susan Coulter
HPC-3 Network/Infrastructure
505-667-8425
Increase the Peace...
An eye for an eye leaves the whole world blind
====================================




More information about the Users mailing list