[Users] IPoIB failing

Tue May 6 08:12:51 PDT 2014

Susan,
We discovered that the problem is related to datagram mode. When we 
switched this host, and all the hosts on that fabric to connected mode, 
the problem stopped. I had chosen datagram mode because I wasn't able to 
produce any throughput tests that showed connected mode to be any faster 
and I had read somewhere that connected mode had some stability issues. 
Datagram mode seemed to be the default for the OS so I assumed that it 
was set that way for a reason. I am guessing that you are using 
connected mode. If so, why did you choose that? I'm also curious what 
kind of throughput to expect?

Thanks,
Mike

On 5/5/14, 4:23 PM, Coulter, Susan K wrote:
>
> We have many nodes performing this same task (basically acting as a router/media converter) that are running on ConnectX3 cards.
>
> Some of them are CentOS 6.2, with a slightly older kernel.
> Some of them are RHEL with the same kernel you have.
>
> None of them are exhibiting this behavior.
>
> Not sure if it helps, other than to say that ConnectX3 and that kernel should work together just fine.
> The version of CentOS should not matter, as the error is coming from the kernel.
>
> Since the fix is to down and up the interface - it seems that the kernel modules may be getting inserted out of order or something.
>
> Once you down/up the interface, does it ever happen again on the same node ?
>
>
> On Apr 24, 2014, at 5:35 PM, Michael Robbert <mrobbert at mines.edu> wrote:
>
>> I have a host that is running CentOS 6.4 with kernel 2.6.32-358.el6.x86_64 and the OFED stack that shipped with that kernel. It has a Mellanox ConnectX-3 HCA which is configured with IPoIB. The only thing this host is doing is routing IP packets to/from the IPoIB interface to a bonded interface containing to 10Gbps Ethernet ports (Solarflare NIC if that matters). The node has been running fine for some time now while I've been preparing the systems on either side. Recently we put this into production allowing real users to run jobs and 3 times in the past week the IPoIB interface has become unresponsive around the time that we see this message:
>>
>> Apr 24 15:39:09 ibrtr-ct2 kernel: ib0: failed send event (status=1, wrid=13 vend_err 69)
>>
>> In each case the Status is always 1, the wrid varies, and the vend_err is always 69. No other errors are seen and it appears that lower level IB functions still work fine. i.e. ibstatus is active and ibhosts sees the host. Also ibchecknet doesn't show any problems. I have been able to fix the problem by downing the interface and bringing it back up with ifdown and ifup.
>>
>> Has anybody seen these symptoms or know specifically what the error message means? Any thoughts on where the problem lies or how to find that out?
>> My next planned step is to upgrade the kernel to a later CentOS 6.4 kernel. If that doesn't help I may try replacing the HCA with an old Infinihost III card. We have another one of these boxes in another building that still has its old Infinihost III card and isn't seeing this problem and it is seeing the other side of all of this traffic.
>>
>> Thanks,
>> Mike Robbert
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/users
>
> ====================================
>
> Susan Coulter
> HPC-3 Network/Infrastructure
> 505-667-8425
> Increase the Peace...
> An eye for an eye leaves the whole world blind
> ====================================
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4003 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20140506/3ac5d056/attachment.bin>