[Users] IPoIB on CentOS 6.5

Trey Dockendorf treydock at tamu.edu
Wed Mar 18 09:39:56 PDT 2015


The issue I faced was caused by a faulty IB fabric switch.  Our Voltaire
would periodically crash resulting in the subnet manager failing.  The SM
is hosted on the switch (Voltaire GridDirector).  The switch itself began
to do this once every few weeks so I stopped using IB on the non-compute
servers that hosted virtual machine storage as I did not want the
unreliable switch to bring down our VMs (this includes our login nodes).

The symptom that made it clear when the switch was the cause of IPoIB
issues was if a system was rebooted it would fail to get an LID, and the
ibstat output would just show as "Initializing" until  the switch was
rebooted.

The other issue that coincided with this is that one of the fabric boards
on our switch failed resulting in some of the chassis blades being
"orphaned" and unable to communicate with systems on the blades managed by
the working fabric boards.  This issue is somewhat specific to our switch
as we have all systems run directly to the core IB switch.

Good luck,
- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock at tamu.edu
Jabber: treydock at tamu.edu

On Tue, Mar 17, 2015 at 9:54 AM, Mehmet Soysal <mehmet.soysal at kit.edu>
wrote:

> Hi,
> did you solved the problem ?
> We have a similar issue since a upgrade to RHEL 6.5 or higher.
>
> On our nodes ipoib is not working any longer after a opensm fail over
> occurs.
> We have serveral nodes from different vendors. All Red-Hat machines are
> affected,
> SUSE machines are working fine after a opensm fail over.
>
> We did not noticed that issue, cause after a reboot ipoib is doing fine
> and then suddenly stops working on all nodes. Everything else is still
> working fine,
> like mpi communication or lustre. But if the Client need to reconnect to a
> lustre server,
> due to a lustre failover, this is initially done over IP (ipoib).
> This took a long time until we pinned that issue down to a opensm fail
> over.
>
> Our RHEL nodes have also ConnectX3 cards.
> Update to RHEL 6.6 does not solve this issue.
> We opened a Case at Redhat for it and waiting for a fix or a solution.
>
>
>
> best regards
> M.Soysal
>
>
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/users/attachments/20150318/57cd7434/attachment.html>


More information about the Users mailing list