[Users] IPoIB on CentOS 6.5
Peter Kjellström
cap at nsc.liu.se
Thu Mar 19 02:04:40 PDT 2015
On Wed, 18 Mar 2015 21:08:17 +0000
"Foraker, Jim" <foraker1 at llnl.gov> wrote:
> Does ³known broken²
By "known broken" I meant
1) several sites including ours had to back off to older or patched
version to get sanity for IPoIB
And
2) We cased this to Redhat and they've been working on a fix. Our
support case nr for this is 01321081 and I suspect also bz1159925.
work on linux-rdma:
[PATCH V3 FIX For-3.19 0/3] IB/ipoib: Fix multicast join flow
https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg22511.html
> mean Mehmet¹s case where IPoIB dies after an
> opensm failover, or broken in other ways?
And the failure mode is essentially that islands of connectivity form
as the SM is restarted (a secondary symptom is that the ib_ipoib module
cannot be unloaded once broken / after sm restart).
Here's a step by step way that shows the problem on one of our system
(written by a colleague):
--- begin example
IPoIB does not handle subnet manager restarts.
I will show this using an example from yesterday:
n[464-472] ran CentOS 6.5
n[564-572] ran CentOS 6.6
The IPoIB interface ib0 was down on all nodes, and we had just restarted
OpenSM.
Step 1: Bring up IPoIB on 7 nodes running 6.5 and 7 nodes running 6.6:
# pdsh -w "n[564-570],n[464-470]" ifup ib0
Step 2: Verify connectivity
All nodes can ping all other nodes:
# pdsh -w "n[564-570],n[464-470]" coping -o -e "ni[564-570],ni[464-470]"|pshbak -c
----------------
n[464-470,564-570]
----------------
2014-12-11 15:03:54 ni[464-470,564-570] initially up
Step 3: Restart OpenSM
Step 4: Verify connectivity again:
Still OK:
# pdsh -w "n[564-570],n[464-470]" coping -o -e "ni[564-570],ni[464-470]"|pshbak -c
----------------
n[464-470,564-570]
----------------
2014-12-11 15:07:01 ni[464-470,564-570] initially up
Step 5: Start IPoIB on 4 additional nodes (two 6.5 and two 6.6):
# pdsh -w "n[571-572,471,472]" ifup ib0
Step 6: Verify connectivity:
Broken:
* 6.6 nodes started in Step 1 can still ping all nodes from Step 1,
but not the nodes started in Step 5.
* 6.5 nodes started in Step 1 can ping everything.
* Nodes from Step 5 can ping each other, but only the 6.5 nodes from
Step 1, not the 6.6.
[root at trio yum.repos.d]# pdsh -w "n[564-572],n[464-472]" coping -o -e "ni[564-572],ni[464-472]"|sort|pshbak -c
----------------
n[564-570]
----------------
2014-12-11 15:08:24 ni[464-470,564-570] initially up
2014-12-11 15:08:24 ni[471-472,571-572] initially DOWN
----------------
n[464-470]
----------------
2014-12-11 15:08:24 ni[464-472,564-572] initially up
----------------
n[471-472,571-572]
----------------
2014-12-11 15:08:24 ni[464-472,571-572] initially up
2014-12-11 15:08:24 ni[564-570] initially DOWN
---- end example
> The only issue we¹ve seen
> with IPoIB in RHEL 6.6 has been a bug with QIB hardware and
> kernel-based RDMA (Lustre, SRP). Is there a RHEL bugzilla bug open on
> the issue(s)?
For bug id and possible bz see beginning of my e-mail.
Is there a bz for the QIB bug you mentioned? (we've seen this too
and switched to ofed-3.12-1 on that system that required lnet on qib).
/Peter
> Jim
>
>
> On 3/18/15, 10:30 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
>
> >On Tue, 17 Mar 2015 15:54:02 +0100
> >Mehmet Soysal <mehmet.soysal at kit.edu> wrote:
> >
> >> Hi,
> >> did you solved the problem ?
> >> We have a similar issue since a upgrade to RHEL 6.5 or higher.
> >>
> >> On our nodes ipoib is not working any longer after a opensm fail
> >> over occurs.
> >
> >Actually IPoIB is known broken in rhel6 (6.5 zstream 431-x, x > 37
> >and for 6.6 all released -504). Redhat knows this and is working on
> >a fix (there may be a candidate fix kernel to request). Meanwhile
> >we've rebuilt latest -504 with the ipoib from 6.5 (which works fine
> >for us).
> >
> >If you're interested in our -504 pkgs with old/working ipoib contact
> >me offlist.
> >
> >Since last week you also get the additional complication of the verbs
> >CVE to take into account when picking a working setup...
> >
> >/Peter K
> >_______________________________________________
> >Users mailing list
> >Users at lists.openfabrics.org
> >http://lists.openfabrics.org/mailman/listinfo/users
> >
>
>
More information about the Users
mailing list