[Users] IPoIB on CentOS 6.5
ira.weiny
ira.weiny at intel.com
Fri Mar 20 19:59:00 PDT 2015
On Fri, Mar 20, 2015 at 02:28:27PM +0100, Mehmet Soysal wrote:
> Hi,
> we also think this is pretty serious.
> We could not find anything obvious in the logs.
> The new SM lid is assigned properly.
>
> Here is an example:
> Nodes fhbn[001-002]-i are freshly rebooted.
>
> fhbn002:~# ping fhbn001-i
> PING fhbn001-i.localdomain (172.26.24.1) 56(84) bytes of data.
> 64 bytes from fhbn001-i.localdomain (172.26.24.1): icmp_seq=1 ttl=64
> time=2.20 ms
>
> fhbn002:~# tcpdump -i ib0
> 14:08:17.930987 ARP, Request who-has fhbn001-i.localdomain tell
> fhbn002-i.localdomain, length 56
> 14:08:17.931169 ARP, Reply fhbn001-i.localdomain is-at
> 80:00:00:48:fe:80:00:00:00:00:00:00:00:1e:67:03:00:4f:db:97, length 56
>
> fhbn002:~# ibstat | grep SM
> SM lid: 1
>
> fhbn002:~# cat /sys/kernel/debug/ipoib/ib0_mcg
> GID: ff12:401b:ffff:0:0:0:0:1
> created: 4363417473
> queuelen: 0
> complete: yes
> send_only: no
>
> GID: ff12:401b:ffff:0:0:0:ffff:ffff
> created: 4363417473
> queuelen: 0
> complete: yes
> send_only: no
>
> GID: ff12:601b:ffff:0:0:0:0:1
> created: 4363417473
> queuelen: 0
> complete: yes
> send_only: no
>
> GID: ff12:601b:ffff:0:0:0:0:16
> created: 4363417525
> queuelen: 2
> complete: no
> send_only: yes
>
> GID: ff12:601b:ffff:0:0:1:ff1a:1802
> created: 4363418568
> queuelen: 1
> complete: no
> send_only: no
>
> GID: ff12:601b:ffff:0:0:1:ff4f:f33f
> created: 4363417523
> queuelen: 0
> complete: yes
> send_only: no
>
>
> everything looks fine.
> Now stopping primary OpenSM and start OpenSM on backupserver
> and wait until arp cache is cleaned.
>
> fhbn002:~# ping fhbn001-i
> From fhbn002-i.localdomain (172.26.24.2) icmp_seq=2 Destination Host
> Unreachable
> From fhbn002-i.localdomain (172.26.24.2) icmp_seq=3 Destination Host
> Unreachable
>
> fhbn002:~# tcpdump -i ib0
> 14:16:59.272994 ARP, Request who-has fhbn001-i.localdomain tell
> fhbn002-i.localdomain, length 56
> 14:17:00.272985 ARP, Request who-has fhbn001-i.localdomain tell
> fhbn002-i.localdomain, length 56
> 14:17:01.272986 ARP, Request who-has fhbn001-i.localdomain tell
> fhbn002-i.localdomain, length 56
>
> fhbn002:~# ibstat | grep SM
> SM lid: 156
>
> fhbn002:~# cat /sys/kernel/debug/ipoib/ib0_mcg
> GID: ff12:401b:ffff:0:0:0:0:1
> created: 4379686326
> queuelen: 0
> complete: no
> send_only: no
>
> GID: ff12:601b:ffff:0:0:0:0:1
> created: 4379686326
> queuelen: 0
> complete: no
> send_only: no
>
> GID: ff12:601b:ffff:0:0:1:ff1a:1802
> created: 4379686326
> queuelen: 0
> complete: no
> send_only: no
>
> GID: ff12:601b:ffff:0:0:1:ff4f:f33f
> created: 4379686326
> queuelen: 0
> complete: no
> send_only: no
>
>
> Do i read it correctly that after a opensm switchover
> a IB client is not getting or joining multicast groups ?
Yes.
> Switching back to primary opensm does not change anything.
>
> Only our Redhat 6.5 clients (or newer) are affected by this.
> Only solution is to reboot the clients, (power cycle cause ipoib module
> cant be unloaded)
Looks like RHEL has some issues which I know Doug is working on. I can't say
for sure what patches are in which kernels you are running so I can't speak to
the specifics.
What I would recommend is trying the current patch series which I referenced in
this thread before.
https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23114.html
Ira
>
> best regards
> M.Soysal
>
>
>
>
> On 19.03.2015 22:25, Weiny, Ira wrote:
> >>Hi,
> >>thats good to hear, that this issue is put on high priority.
> >>Our Redhat case is 01368360.
> >>
> >>Our Problem with ipoib is slightly different of what Peter explained.
> >>I did not noticed any islands being formed.
> >>After a opensm failover, none of the client can use the ipoib any more and
> >>unloading the ib_ipoib is also not possible.
> >This seems pretty serious. Any idea why?
> >
> >>What i noticed is that the arp requests are not answered after a failover.
> >>If a node has still a valid arp cache entry for another IB node he can
> >>still ping it.
> >>After clearing cache the client does not get any arp answers for the
> >>previous
> >>node.
> >>
> >>Hope that Redhat fixes this issue soon.
> >>
> >Has the failover completed? Did the SM Lid get properly reassigned?
> >
> >For a node which is failing arp are the mcast groups joined?
> >
> >Besides the opensm log, and saquery tools; Ipoib has some debugfs entries
> >which can help here.
> >
> >[root at phcppriv12 oib_utils]# cat /sys/kernel/debug/ipoib/ib0_mcg
> >GID: ff12:401b:ffff:0:0:0:0:1
> > created: 6034115225
> > queuelen: 0
> > complete: yes
> > send_only: no
> >
> >...
> >
> >
> >This sounds like an issue with Client Reregister/mcast join after the
> >failover.
> >
> >If possible it would be nice if those experiencing IPoIB issues like this
> >could try Dougs latest patch series which I believe fix various mcast join
> >issues with IPoIB.
> >
> >https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23114.html
> >
> >Ira
> >
> >
> >>
> >>best regards
> >>M.Soysal
> >>
> >>
> >>
> >>On 19.03.2015 17:17, Foraker, Jim wrote:
> >>>Peter,
> >>> Thanks. I've told our RedHat folks that the IPoIB issue is a
> >>>high priority for us. Our bug for the qib kernel RDMA issue is
> >>>1188417, which was closed as a duplicate of
> >>>https://bugzilla.redhat.com/show_bug.cgi?id=1171803.
> >>>
> >>> Jim
> >>>
> >>>On 3/19/15, 2:04 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
> >>>
> >>>>On Wed, 18 Mar 2015 21:08:17 +0000
> >>>>"Foraker, Jim" <foraker1 at llnl.gov> wrote:
> >>>>
> >>>>> Does ³known broken²
> >>>>By "known broken" I meant
> >>>>1) several sites including ours had to back off to older or patched
> >>>>version to get sanity for IPoIB And
> >>>>2) We cased this to Redhat and they've been working on a fix. Our
> >>>>support case nr for this is 01321081 and I suspect also bz1159925.
> >>>>
> >>>>work on linux-rdma:
> >>>> [PATCH V3 FIX For-3.19 0/3] IB/ipoib: Fix multicast join flow
> >>>>
> >>>>https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg22511.html
> >>>>
> >>>>>mean Mehmet¹s case where IPoIB dies after an opensm failover, or
> >>>>>broken in other ways?
> >>>>And the failure mode is essentially that islands of connectivity form
> >>>>as the SM is restarted (a secondary symptom is that the ib_ipoib
> >>>>module cannot be unloaded once broken / after sm restart).
> >>>>
> >>>>Here's a step by step way that shows the problem on one of our system
> >>>>(written by a colleague):
> >>>>
> >>>>--- begin example
> >>>>
> >>>>IPoIB does not handle subnet manager restarts.
> >>>>
> >>>>I will show this using an example from yesterday:
> >>>>
> >>>> n[464-472] ran CentOS 6.5
> >>>> n[564-572] ran CentOS 6.6
> >>>>
> >>>>The IPoIB interface ib0 was down on all nodes, and we had just
> >>>>restarted OpenSM.
> >>>>
> >>>>Step 1: Bring up IPoIB on 7 nodes running 6.5 and 7 nodes running 6.6:
> >>>>
> >>>> # pdsh -w "n[564-570],n[464-470]" ifup ib0
> >>>>
> >>>>Step 2: Verify connectivity
> >>>>
> >>>> All nodes can ping all other nodes:
> >>>>
> >>>> # pdsh -w "n[564-570],n[464-470]" coping -o -e
> >>>>"ni[564-570],ni[464-470]"|pshbak -c
> >>>> ----------------
> >>>> n[464-470,564-570]
> >>>> ----------------
> >>>> 2014-12-11 15:03:54 ni[464-470,564-570] initially up
> >>>>
> >>>>Step 3: Restart OpenSM
> >>>>
> >>>>Step 4: Verify connectivity again:
> >>>>
> >>>> Still OK:
> >>>> # pdsh -w "n[564-570],n[464-470]" coping -o -e
> >>>>"ni[564-570],ni[464-470]"|pshbak -c
> >>>> ----------------
> >>>> n[464-470,564-570]
> >>>> ----------------
> >>>> 2014-12-11 15:07:01 ni[464-470,564-570] initially up
> >>>>
> >>>>Step 5: Start IPoIB on 4 additional nodes (two 6.5 and two 6.6):
> >>>>
> >>>> # pdsh -w "n[571-572,471,472]" ifup ib0
> >>>>
> >>>>Step 6: Verify connectivity:
> >>>>
> >>>> Broken:
> >>>> * 6.6 nodes started in Step 1 can still ping all nodes from Step 1,
> >>>> but not the nodes started in Step 5.
> >>>> * 6.5 nodes started in Step 1 can ping everything.
> >>>> * Nodes from Step 5 can ping each other, but only the 6.5 nodes from
> >>>> Step 1, not the 6.6.
> >>>>
> >>>> [root at trio yum.repos.d]# pdsh -w "n[564-572],n[464-472]" coping
> >>>>-o -e "ni[564-572],ni[464-472]"|sort|pshbak -c
> >>>> ----------------
> >>>> n[564-570]
> >>>> ----------------
> >>>> 2014-12-11 15:08:24 ni[464-470,564-570] initially up
> >>>> 2014-12-11 15:08:24 ni[471-472,571-572] initially DOWN
> >>>> ----------------
> >>>> n[464-470]
> >>>> ----------------
> >>>> 2014-12-11 15:08:24 ni[464-472,564-572] initially up
> >>>> ----------------
> >>>> n[471-472,571-572]
> >>>> ----------------
> >>>> 2014-12-11 15:08:24 ni[464-472,571-572] initially up
> >>>> 2014-12-11 15:08:24 ni[564-570] initially DOWN
> >>>>
> >>>>---- end example
> >>>>
> >>>>> The only issue we¹ve seen
> >>>>>with IPoIB in RHEL 6.6 has been a bug with QIB hardware and
> >>>>>kernel-based RDMA (Lustre, SRP). Is there a RHEL bugzilla bug open
> >>>>>on the issue(s)?
> >>>>For bug id and possible bz see beginning of my e-mail.
> >>>>
> >>>>Is there a bz for the QIB bug you mentioned? (we've seen this too and
> >>>>switched to ofed-3.12-1 on that system that required lnet on qib).
> >>>>
> >>>>/Peter
> >>>>
> >>>>> Jim
> >>>>>
> >>>>>
> >>>>>On 3/18/15, 10:30 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
> >>>>>
> >>>>>>On Tue, 17 Mar 2015 15:54:02 +0100
> >>>>>>Mehmet Soysal <mehmet.soysal at kit.edu> wrote:
> >>>>>>
> >>>>>>>Hi,
> >>>>>>>did you solved the problem ?
> >>>>>>>We have a similar issue since a upgrade to RHEL 6.5 or higher.
> >>>>>>>
> >>>>>>>On our nodes ipoib is not working any longer after a opensm fail
> >>>>>>>over occurs.
> >>>>>>Actually IPoIB is known broken in rhel6 (6.5 zstream 431-x, x > 37
> >>>>>>and for 6.6 all released -504). Redhat knows this and is working on
> >>>>>>a fix (there may be a candidate fix kernel to request). Meanwhile
> >>>>>>we've rebuilt latest -504 with the ipoib from 6.5 (which works fine
> >>>>>>for us).
> >>>>>>
> >>>>>>If you're interested in our -504 pkgs with old/working ipoib
> >>>>>>contact me offlist.
> >>>>>>
> >>>>>>Since last week you also get the additional complication of the
> >>>>>>verbs CVE to take into account when picking a working setup...
> >>>>>>
> >>>>>>/Peter K
> >>>>>>_______________________________________________
> >>>>>>Users mailing list
> >>>>>>Users at lists.openfabrics.org
> >>>>>>http://lists.openfabrics.org/mailman/listinfo/users
> >>>>>>
> >>>_______________________________________________
> >>>Users mailing list
> >>>Users at lists.openfabrics.org
> >>>http://lists.openfabrics.org/mailman/listinfo/users
> >>--
> >>----------------------------------------------------------------------------
> >>Mehmet Soysal
> >>Scientific Computing and Services (SCS)
> >>
> >>Karlsruher Institut für Technologie (KIT) Steinbuch Centre for Computing
> >>(SCC)
> >>Zirkel 2, Gebäude 20.21, Raum 206
> >>D-76131 Karlsruhe
> >>Tel. : +49 721 608-46347
> >>Fax : +49 721 32550
> >>Email: Mehmet.Soysal at kit.edu
> >>WWW : http://www.scc.kit.edu
> >>
> >>KIT - Universität des Landes Baden-Württemberg und nationales
> >>Forschungszentrum in der Helmholtz-Gemeinschaft
> >>
> >>_______________________________________________
> >>Users mailing list
> >>Users at lists.openfabrics.org
> >>http://lists.openfabrics.org/mailman/listinfo/users
>
> --
> ----------------------------------------------------------------------------
> Mehmet Soysal
> Scientific Computing and Services (SCS)
>
> Karlsruher Institut für Technologie (KIT)
> Steinbuch Centre for Computing (SCC)
> Zirkel 2, Gebäude 20.21, Raum 206
> D-76131 Karlsruhe
> Tel. : +49 721 608-46347
> Fax : +49 721 32550
> Email: Mehmet.Soysal at kit.edu
> WWW : http://www.scc.kit.edu
>
> KIT - Universität des Landes Baden-Württemberg und
> nationales Forschungszentrum in der Helmholtz-Gemeinschaft
>
More information about the Users
mailing list