[Users] IPoIB on CentOS 6.5

Fri Mar 20 19:59:00 PDT 2015

On Fri, Mar 20, 2015 at 02:28:27PM +0100, Mehmet Soysal wrote:
> Hi,
> we also think this is pretty serious.
> We could not find anything obvious in the logs.
> The new SM lid is assigned properly.
> 
> Here is an example:
> Nodes fhbn[001-002]-i are freshly rebooted.
> 
> fhbn002:~# ping fhbn001-i
> PING fhbn001-i.localdomain (172.26.24.1) 56(84) bytes of data.
> 64 bytes from fhbn001-i.localdomain (172.26.24.1): icmp_seq=1 ttl=64 
> time=2.20 ms
> 
> fhbn002:~# tcpdump -i ib0
> 14:08:17.930987 ARP, Request who-has fhbn001-i.localdomain tell 
> fhbn002-i.localdomain, length 56
> 14:08:17.931169 ARP, Reply fhbn001-i.localdomain is-at 
> 80:00:00:48:fe:80:00:00:00:00:00:00:00:1e:67:03:00:4f:db:97, length 56
> 
> fhbn002:~# ibstat | grep SM
>         SM lid: 1
> 
> fhbn002:~# cat /sys/kernel/debug/ipoib/ib0_mcg
> GID: ff12:401b:ffff:0:0:0:0:1
>   created: 4363417473
>   queuelen:         0
>   complete:       yes
>   send_only:       no
> 
> GID: ff12:401b:ffff:0:0:0:ffff:ffff
>   created: 4363417473
>   queuelen:         0
>   complete:       yes
>   send_only:       no
> 
> GID: ff12:601b:ffff:0:0:0:0:1
>   created: 4363417473
>   queuelen:         0
>   complete:       yes
>   send_only:       no
> 
> GID: ff12:601b:ffff:0:0:0:0:16
>   created: 4363417525
>   queuelen:         2
>   complete:        no
>   send_only:      yes
> 
> GID: ff12:601b:ffff:0:0:1:ff1a:1802
>   created: 4363418568
>   queuelen:         1
>   complete:        no
>   send_only:       no
> 
> GID: ff12:601b:ffff:0:0:1:ff4f:f33f
>   created: 4363417523
>   queuelen:         0
>   complete:       yes
>   send_only:       no
> 
> 
> everything looks fine.
> Now stopping primary OpenSM and start OpenSM on backupserver
> and wait until arp cache is cleaned.
> 
> fhbn002:~# ping fhbn001-i
> From fhbn002-i.localdomain (172.26.24.2) icmp_seq=2 Destination Host 
> Unreachable
> From fhbn002-i.localdomain (172.26.24.2) icmp_seq=3 Destination Host 
> Unreachable
> 
> fhbn002:~# tcpdump -i ib0
> 14:16:59.272994 ARP, Request who-has fhbn001-i.localdomain tell 
> fhbn002-i.localdomain, length 56
> 14:17:00.272985 ARP, Request who-has fhbn001-i.localdomain tell 
> fhbn002-i.localdomain, length 56
> 14:17:01.272986 ARP, Request who-has fhbn001-i.localdomain tell 
> fhbn002-i.localdomain, length 56
> 
> fhbn002:~# ibstat  | grep SM
>         SM lid: 156
> 
> fhbn002:~# cat /sys/kernel/debug/ipoib/ib0_mcg
> GID: ff12:401b:ffff:0:0:0:0:1
>   created: 4379686326
>   queuelen:         0
>   complete:        no
>   send_only:       no
> 
> GID: ff12:601b:ffff:0:0:0:0:1
>   created: 4379686326
>   queuelen:         0
>   complete:        no
>   send_only:       no
> 
> GID: ff12:601b:ffff:0:0:1:ff1a:1802
>   created: 4379686326
>   queuelen:         0
>   complete:        no
>   send_only:       no
> 
> GID: ff12:601b:ffff:0:0:1:ff4f:f33f
>   created: 4379686326
>   queuelen:         0
>   complete:        no
>   send_only:       no
> 
> 
> Do i read it correctly that after a opensm switchover
> a IB client is not getting or joining multicast groups ?

Yes.

> Switching back to primary opensm does not change anything.
> 
> Only our Redhat 6.5 clients (or newer) are affected by this.
> Only solution is to reboot the clients, (power cycle cause ipoib module 
> cant be unloaded)

Looks like RHEL has some issues which I know Doug is working on.  I can't say
for sure what patches are in which kernels you are running so I can't speak to
the specifics.

What I would recommend is trying the current patch series which I referenced in
this thread before.

https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23114.html

Ira

> 
> best regards
> M.Soysal
> 
> 
> 
> 
> On 19.03.2015 22:25, Weiny, Ira wrote:
> >>Hi,
> >>thats good to hear, that this issue is put on high priority.
> >>Our Redhat case is 01368360.
> >>
> >>Our Problem with ipoib is slightly different of what Peter explained.
> >>I did not noticed any islands being formed.
> >>After a opensm failover, none of the client can use the ipoib any more and
> >>unloading the ib_ipoib is also not possible.
> >This seems pretty serious.  Any idea why?
> >  
> >>What i noticed is that the arp requests are not answered after a failover.
> >>If a node has still a valid arp cache entry for another IB node he can 
> >>still ping it.
> >>After clearing cache the client does not get any arp answers for the 
> >>previous
> >>node.
> >>
> >>Hope that Redhat fixes this issue soon.
> >>
> >Has the failover completed?  Did the SM Lid get properly reassigned?
> >
> >For a node which is failing arp are the mcast groups joined?
> >
> >Besides the opensm log, and saquery tools; Ipoib has some debugfs entries 
> >which can help here.
> >
> >[root at phcppriv12 oib_utils]# cat /sys/kernel/debug/ipoib/ib0_mcg
> >GID: ff12:401b:ffff:0:0:0:0:1
> >   created: 6034115225
> >   queuelen:         0
> >   complete:       yes
> >   send_only:       no
> >
> >...
> >
> >
> >This sounds like an issue with Client Reregister/mcast join after the 
> >failover.
> >
> >If possible it would be nice if those experiencing IPoIB issues like this 
> >could try Dougs latest patch series which I believe fix various mcast join 
> >issues with IPoIB.
> >
> >https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23114.html
> >
> >Ira
> >
> >
> >>
> >>best regards
> >>M.Soysal
> >>
> >>
> >>
> >>On 19.03.2015 17:17, Foraker, Jim wrote:
> >>>Peter,
> >>>       Thanks.  I've told our RedHat folks that the IPoIB issue is a
> >>>high priority for us.  Our bug for the qib kernel RDMA issue is
> >>>1188417, which was closed as a duplicate of
> >>>https://bugzilla.redhat.com/show_bug.cgi?id=1171803.
> >>>
> >>>       Jim
> >>>
> >>>On 3/19/15, 2:04 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
> >>>
> >>>>On Wed, 18 Mar 2015 21:08:17 +0000
> >>>>"Foraker, Jim" <foraker1 at llnl.gov> wrote:
> >>>>
> >>>>>       Does ³known broken²
> >>>>By "known broken" I meant
> >>>>1) several sites including ours had to back off to older or patched
> >>>>version to get sanity for IPoIB And
> >>>>2) We cased this to Redhat and they've been working on a fix. Our
> >>>>support case nr for this is 01321081 and I suspect also bz1159925.
> >>>>
> >>>>work on linux-rdma:
> >>>>   [PATCH V3 FIX For-3.19 0/3] IB/ipoib: Fix multicast join flow
> >>>>
> >>>>https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg22511.html
> >>>>
> >>>>>mean Mehmet¹s case where IPoIB dies after an opensm failover, or
> >>>>>broken in other ways?
> >>>>And the failure mode is essentially that islands of connectivity form
> >>>>as the SM is restarted (a secondary symptom is that the ib_ipoib
> >>>>module cannot be unloaded once broken / after sm restart).
> >>>>
> >>>>Here's a step by step way that shows the problem on one of our system
> >>>>(written by a colleague):
> >>>>
> >>>>--- begin example
> >>>>
> >>>>IPoIB does not handle subnet manager restarts.
> >>>>
> >>>>I will show this using an example from yesterday:
> >>>>
> >>>>   n[464-472] ran CentOS 6.5
> >>>>   n[564-572] ran CentOS 6.6
> >>>>
> >>>>The IPoIB interface ib0 was down on all nodes, and we had just
> >>>>restarted OpenSM.
> >>>>
> >>>>Step 1: Bring up IPoIB on 7 nodes running 6.5 and 7 nodes running 6.6:
> >>>>
> >>>>     # pdsh -w "n[564-570],n[464-470]" ifup ib0
> >>>>
> >>>>Step 2: Verify connectivity
> >>>>
> >>>>   All nodes can ping all other nodes:
> >>>>
> >>>>     # pdsh -w "n[564-570],n[464-470]" coping -o -e
> >>>>"ni[564-570],ni[464-470]"|pshbak -c
> >>>>     ----------------
> >>>>     n[464-470,564-570]
> >>>>     ----------------
> >>>>     2014-12-11 15:03:54  ni[464-470,564-570]  initially up
> >>>>
> >>>>Step 3: Restart OpenSM
> >>>>
> >>>>Step 4: Verify connectivity again:
> >>>>
> >>>>   Still OK:
> >>>>     # pdsh -w "n[564-570],n[464-470]" coping -o -e
> >>>>"ni[564-570],ni[464-470]"|pshbak -c
> >>>>     ----------------
> >>>>     n[464-470,564-570]
> >>>>     ----------------
> >>>>     2014-12-11 15:07:01  ni[464-470,564-570]  initially up
> >>>>
> >>>>Step 5: Start IPoIB on 4 additional nodes (two 6.5 and two 6.6):
> >>>>
> >>>>     # pdsh -w "n[571-572,471,472]" ifup ib0
> >>>>
> >>>>Step 6: Verify connectivity:
> >>>>
> >>>>   Broken:
> >>>>   * 6.6 nodes started in Step 1 can still ping all nodes from Step 1,
> >>>>     but not the nodes started in Step 5.
> >>>>   * 6.5 nodes started in Step 1 can ping everything.
> >>>>   * Nodes from Step 5 can ping each other, but only the 6.5 nodes from
> >>>>     Step 1, not the 6.6.
> >>>>
> >>>>     [root at trio yum.repos.d]# pdsh -w "n[564-572],n[464-472]" coping
> >>>>-o -e "ni[564-572],ni[464-472]"|sort|pshbak -c
> >>>>     ----------------
> >>>>     n[564-570]
> >>>>     ----------------
> >>>>     2014-12-11 15:08:24  ni[464-470,564-570]  initially up
> >>>>     2014-12-11 15:08:24  ni[471-472,571-572]  initially DOWN
> >>>>     ----------------
> >>>>     n[464-470]
> >>>>     ----------------
> >>>>     2014-12-11 15:08:24  ni[464-472,564-572]  initially up
> >>>>     ----------------
> >>>>     n[471-472,571-572]
> >>>>     ----------------
> >>>>     2014-12-11 15:08:24  ni[464-472,571-572]  initially up
> >>>>     2014-12-11 15:08:24  ni[564-570]  initially DOWN
> >>>>
> >>>>---- end example
> >>>>
> >>>>>   The only issue we¹ve seen
> >>>>>with IPoIB in RHEL 6.6 has been a bug with QIB hardware and
> >>>>>kernel-based RDMA (Lustre, SRP). Is there a RHEL bugzilla bug open
> >>>>>on the issue(s)?
> >>>>For bug id and possible bz see beginning of my e-mail.
> >>>>
> >>>>Is there a bz for the QIB bug you mentioned? (we've seen this too and
> >>>>switched to ofed-3.12-1 on that system that required lnet on qib).
> >>>>
> >>>>/Peter
> >>>>
> >>>>>       Jim
> >>>>>
> >>>>>
> >>>>>On 3/18/15, 10:30 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
> >>>>>
> >>>>>>On Tue, 17 Mar 2015 15:54:02 +0100
> >>>>>>Mehmet Soysal <mehmet.soysal at kit.edu> wrote:
> >>>>>>
> >>>>>>>Hi,
> >>>>>>>did you solved the problem ?
> >>>>>>>We have a similar issue since a upgrade to RHEL 6.5 or higher.
> >>>>>>>
> >>>>>>>On our nodes ipoib is not working any longer after a opensm fail
> >>>>>>>over occurs.
> >>>>>>Actually IPoIB is known broken in rhel6 (6.5 zstream 431-x, x > 37
> >>>>>>and for 6.6 all released -504). Redhat knows this and is working on
> >>>>>>a fix (there may be a candidate fix kernel to request). Meanwhile
> >>>>>>we've rebuilt latest -504 with the ipoib from 6.5 (which works fine
> >>>>>>for us).
> >>>>>>
> >>>>>>If you're interested in our -504 pkgs with old/working ipoib
> >>>>>>contact me offlist.
> >>>>>>
> >>>>>>Since last week you also get the additional complication of the
> >>>>>>verbs CVE to take into account when picking a working setup...
> >>>>>>
> >>>>>>/Peter K
> >>>>>>_______________________________________________
> >>>>>>Users mailing list
> >>>>>>Users at lists.openfabrics.org
> >>>>>>http://lists.openfabrics.org/mailman/listinfo/users
> >>>>>>
> >>>_______________________________________________
> >>>Users mailing list
> >>>Users at lists.openfabrics.org
> >>>http://lists.openfabrics.org/mailman/listinfo/users
> >>--
> >>----------------------------------------------------------------------------
> >>Mehmet Soysal
> >>Scientific Computing and Services (SCS)
> >>
> >>Karlsruher Institut für Technologie (KIT) Steinbuch Centre for Computing 
> >>(SCC)
> >>Zirkel 2, Gebäude 20.21, Raum 206
> >>D-76131 Karlsruhe
> >>Tel. : +49 721 608-46347
> >>Fax  : +49 721 32550
> >>Email: Mehmet.Soysal at kit.edu
> >>WWW : http://www.scc.kit.edu
> >>
> >>KIT - Universität des Landes Baden-Württemberg und nationales
> >>Forschungszentrum in der Helmholtz-Gemeinschaft
> >>
> >>_______________________________________________
> >>Users mailing list
> >>Users at lists.openfabrics.org
> >>http://lists.openfabrics.org/mailman/listinfo/users
> 
> -- 
> ----------------------------------------------------------------------------
> Mehmet Soysal
> Scientific Computing and Services (SCS)
> 
> Karlsruher Institut für Technologie (KIT)
> Steinbuch Centre for Computing (SCC)
> Zirkel 2, Gebäude 20.21, Raum 206
> D-76131 Karlsruhe
> Tel. : +49 721 608-46347
> Fax  : +49 721 32550
> Email: Mehmet.Soysal at kit.edu
> WWW : http://www.scc.kit.edu
> 
> KIT - Universität des Landes Baden-Württemberg und
> nationales Forschungszentrum in der Helmholtz-Gemeinschaft
>