[Users] IPoIB on CentOS 6.5

Fri Mar 20 06:28:27 PDT 2015

Hi,
we also think this is pretty serious.
We could not find anything obvious in the logs.
The new SM lid is assigned properly.

Here is an example:
Nodes fhbn[001-002]-i are freshly rebooted.

fhbn002:~# ping fhbn001-i
PING fhbn001-i.localdomain (172.26.24.1) 56(84) bytes of data.
64 bytes from fhbn001-i.localdomain (172.26.24.1): icmp_seq=1 ttl=64 
time=2.20 ms

fhbn002:~# tcpdump -i ib0
14:08:17.930987 ARP, Request who-has fhbn001-i.localdomain tell 
fhbn002-i.localdomain, length 56
14:08:17.931169 ARP, Reply fhbn001-i.localdomain is-at 
80:00:00:48:fe:80:00:00:00:00:00:00:00:1e:67:03:00:4f:db:97, length 56

fhbn002:~# ibstat | grep SM
         SM lid: 1

fhbn002:~# cat /sys/kernel/debug/ipoib/ib0_mcg
GID: ff12:401b:ffff:0:0:0:0:1
   created: 4363417473
   queuelen:         0
   complete:       yes
   send_only:       no

GID: ff12:401b:ffff:0:0:0:ffff:ffff
   created: 4363417473
   queuelen:         0
   complete:       yes
   send_only:       no

GID: ff12:601b:ffff:0:0:0:0:1
   created: 4363417473
   queuelen:         0
   complete:       yes
   send_only:       no

GID: ff12:601b:ffff:0:0:0:0:16
   created: 4363417525
   queuelen:         2
   complete:        no
   send_only:      yes

GID: ff12:601b:ffff:0:0:1:ff1a:1802
   created: 4363418568
   queuelen:         1
   complete:        no
   send_only:       no

GID: ff12:601b:ffff:0:0:1:ff4f:f33f
   created: 4363417523
   queuelen:         0
   complete:       yes
   send_only:       no

everything looks fine.
Now stopping primary OpenSM and start OpenSM on backupserver
and wait until arp cache is cleaned.

fhbn002:~# ping fhbn001-i
 From fhbn002-i.localdomain (172.26.24.2) icmp_seq=2 Destination Host 
Unreachable
 From fhbn002-i.localdomain (172.26.24.2) icmp_seq=3 Destination Host 
Unreachable

fhbn002:~# tcpdump -i ib0
14:16:59.272994 ARP, Request who-has fhbn001-i.localdomain tell 
fhbn002-i.localdomain, length 56
14:17:00.272985 ARP, Request who-has fhbn001-i.localdomain tell 
fhbn002-i.localdomain, length 56
14:17:01.272986 ARP, Request who-has fhbn001-i.localdomain tell 
fhbn002-i.localdomain, length 56

fhbn002:~# ibstat  | grep SM
         SM lid: 156

fhbn002:~# cat /sys/kernel/debug/ipoib/ib0_mcg
GID: ff12:401b:ffff:0:0:0:0:1
   created: 4379686326
   queuelen:         0
   complete:        no
   send_only:       no

GID: ff12:601b:ffff:0:0:0:0:1
   created: 4379686326
   queuelen:         0
   complete:        no
   send_only:       no

GID: ff12:601b:ffff:0:0:1:ff1a:1802
   created: 4379686326
   queuelen:         0
   complete:        no
   send_only:       no

GID: ff12:601b:ffff:0:0:1:ff4f:f33f
   created: 4379686326
   queuelen:         0
   complete:        no
   send_only:       no

Do i read it correctly that after a opensm switchover
a IB client is not getting or joining multicast groups ?
Switching back to primary opensm does not change anything.

Only our Redhat 6.5 clients (or newer) are affected by this.
Only solution is to reboot the clients, (power cycle cause ipoib module 
cant be unloaded)

best regards
M.Soysal

On 19.03.2015 22:25, Weiny, Ira wrote:
>> Hi,
>> thats good to hear, that this issue is put on high priority.
>> Our Redhat case is 01368360.
>>
>> Our Problem with ipoib is slightly different of what Peter explained.
>> I did not noticed any islands being formed.
>> After a opensm failover, none of the client can use the ipoib any more and
>> unloading the ib_ipoib is also not possible.
> This seems pretty serious.  Any idea why?
>   
>> What i noticed is that the arp requests are not answered after a failover.
>> If a node has still a valid arp cache entry for another IB node he can still ping it.
>> After clearing cache the client does not get any arp answers for the previous
>> node.
>>
>> Hope that Redhat fixes this issue soon.
>>
> Has the failover completed?  Did the SM Lid get properly reassigned?
>
> For a node which is failing arp are the mcast groups joined?
>
> Besides the opensm log, and saquery tools; Ipoib has some debugfs entries which can help here.
>
> [root at phcppriv12 oib_utils]# cat /sys/kernel/debug/ipoib/ib0_mcg
> GID: ff12:401b:ffff:0:0:0:0:1
>    created: 6034115225
>    queuelen:         0
>    complete:       yes
>    send_only:       no
>
> ...
>
>
> This sounds like an issue with Client Reregister/mcast join after the failover.
>
> If possible it would be nice if those experiencing IPoIB issues like this could try Dougs latest patch series which I believe fix various mcast join issues with IPoIB.
>
> https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg23114.html
>
> Ira
>
>
>>
>> best regards
>> M.Soysal
>>
>>
>>
>> On 19.03.2015 17:17, Foraker, Jim wrote:
>>> Peter,
>>>        Thanks.  I've told our RedHat folks that the IPoIB issue is a
>>> high priority for us.  Our bug for the qib kernel RDMA issue is
>>> 1188417, which was closed as a duplicate of
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1171803.
>>>
>>>        Jim
>>>
>>> On 3/19/15, 2:04 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
>>>
>>>> On Wed, 18 Mar 2015 21:08:17 +0000
>>>> "Foraker, Jim" <foraker1 at llnl.gov> wrote:
>>>>
>>>>>        Does ³known broken²
>>>> By "known broken" I meant
>>>> 1) several sites including ours had to back off to older or patched
>>>> version to get sanity for IPoIB And
>>>> 2) We cased this to Redhat and they've been working on a fix. Our
>>>> support case nr for this is 01321081 and I suspect also bz1159925.
>>>>
>>>> work on linux-rdma:
>>>>    [PATCH V3 FIX For-3.19 0/3] IB/ipoib: Fix multicast join flow
>>>>
>>>> https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg22511.html
>>>>
>>>>> mean Mehmet¹s case where IPoIB dies after an opensm failover, or
>>>>> broken in other ways?
>>>> And the failure mode is essentially that islands of connectivity form
>>>> as the SM is restarted (a secondary symptom is that the ib_ipoib
>>>> module cannot be unloaded once broken / after sm restart).
>>>>
>>>> Here's a step by step way that shows the problem on one of our system
>>>> (written by a colleague):
>>>>
>>>> --- begin example
>>>>
>>>> IPoIB does not handle subnet manager restarts.
>>>>
>>>> I will show this using an example from yesterday:
>>>>
>>>>    n[464-472] ran CentOS 6.5
>>>>    n[564-572] ran CentOS 6.6
>>>>
>>>> The IPoIB interface ib0 was down on all nodes, and we had just
>>>> restarted OpenSM.
>>>>
>>>> Step 1: Bring up IPoIB on 7 nodes running 6.5 and 7 nodes running 6.6:
>>>>
>>>>      # pdsh -w "n[564-570],n[464-470]" ifup ib0
>>>>
>>>> Step 2: Verify connectivity
>>>>
>>>>    All nodes can ping all other nodes:
>>>>
>>>>      # pdsh -w "n[564-570],n[464-470]" coping -o -e
>>>> "ni[564-570],ni[464-470]"|pshbak -c
>>>>      ----------------
>>>>      n[464-470,564-570]
>>>>      ----------------
>>>>      2014-12-11 15:03:54  ni[464-470,564-570]  initially up
>>>>
>>>> Step 3: Restart OpenSM
>>>>
>>>> Step 4: Verify connectivity again:
>>>>
>>>>    Still OK:
>>>>      # pdsh -w "n[564-570],n[464-470]" coping -o -e
>>>> "ni[564-570],ni[464-470]"|pshbak -c
>>>>      ----------------
>>>>      n[464-470,564-570]
>>>>      ----------------
>>>>      2014-12-11 15:07:01  ni[464-470,564-570]  initially up
>>>>
>>>> Step 5: Start IPoIB on 4 additional nodes (two 6.5 and two 6.6):
>>>>
>>>>      # pdsh -w "n[571-572,471,472]" ifup ib0
>>>>
>>>> Step 6: Verify connectivity:
>>>>
>>>>    Broken:
>>>>    * 6.6 nodes started in Step 1 can still ping all nodes from Step 1,
>>>>      but not the nodes started in Step 5.
>>>>    * 6.5 nodes started in Step 1 can ping everything.
>>>>    * Nodes from Step 5 can ping each other, but only the 6.5 nodes from
>>>>      Step 1, not the 6.6.
>>>>
>>>>      [root at trio yum.repos.d]# pdsh -w "n[564-572],n[464-472]" coping
>>>> -o -e "ni[564-572],ni[464-472]"|sort|pshbak -c
>>>>      ----------------
>>>>      n[564-570]
>>>>      ----------------
>>>>      2014-12-11 15:08:24  ni[464-470,564-570]  initially up
>>>>      2014-12-11 15:08:24  ni[471-472,571-572]  initially DOWN
>>>>      ----------------
>>>>      n[464-470]
>>>>      ----------------
>>>>      2014-12-11 15:08:24  ni[464-472,564-572]  initially up
>>>>      ----------------
>>>>      n[471-472,571-572]
>>>>      ----------------
>>>>      2014-12-11 15:08:24  ni[464-472,571-572]  initially up
>>>>      2014-12-11 15:08:24  ni[564-570]  initially DOWN
>>>>
>>>> ---- end example
>>>>
>>>>>    The only issue we¹ve seen
>>>>> with IPoIB in RHEL 6.6 has been a bug with QIB hardware and
>>>>> kernel-based RDMA (Lustre, SRP). Is there a RHEL bugzilla bug open
>>>>> on the issue(s)?
>>>> For bug id and possible bz see beginning of my e-mail.
>>>>
>>>> Is there a bz for the QIB bug you mentioned? (we've seen this too and
>>>> switched to ofed-3.12-1 on that system that required lnet on qib).
>>>>
>>>> /Peter
>>>>
>>>>>        Jim
>>>>>
>>>>>
>>>>> On 3/18/15, 10:30 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
>>>>>
>>>>>> On Tue, 17 Mar 2015 15:54:02 +0100
>>>>>> Mehmet Soysal <mehmet.soysal at kit.edu> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> did you solved the problem ?
>>>>>>> We have a similar issue since a upgrade to RHEL 6.5 or higher.
>>>>>>>
>>>>>>> On our nodes ipoib is not working any longer after a opensm fail
>>>>>>> over occurs.
>>>>>> Actually IPoIB is known broken in rhel6 (6.5 zstream 431-x, x > 37
>>>>>> and for 6.6 all released -504). Redhat knows this and is working on
>>>>>> a fix (there may be a candidate fix kernel to request). Meanwhile
>>>>>> we've rebuilt latest -504 with the ipoib from 6.5 (which works fine
>>>>>> for us).
>>>>>>
>>>>>> If you're interested in our -504 pkgs with old/working ipoib
>>>>>> contact me offlist.
>>>>>>
>>>>>> Since last week you also get the additional complication of the
>>>>>> verbs CVE to take into account when picking a working setup...
>>>>>>
>>>>>> /Peter K
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/mailman/listinfo/users
>>>>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/users
>> --
>> ----------------------------------------------------------------------------
>> Mehmet Soysal
>> Scientific Computing and Services (SCS)
>>
>> Karlsruher Institut für Technologie (KIT) Steinbuch Centre for Computing (SCC)
>> Zirkel 2, Gebäude 20.21, Raum 206
>> D-76131 Karlsruhe
>> Tel. : +49 721 608-46347
>> Fax  : +49 721 32550
>> Email: Mehmet.Soysal at kit.edu
>> WWW : http://www.scc.kit.edu
>>
>> KIT - Universität des Landes Baden-Württemberg und nationales
>> Forschungszentrum in der Helmholtz-Gemeinschaft
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/users

-- 
----------------------------------------------------------------------------
Mehmet Soysal
Scientific Computing and Services (SCS)

Karlsruher Institut für Technologie (KIT)
Steinbuch Centre for Computing (SCC)
Zirkel 2, Gebäude 20.21, Raum 206
D-76131 Karlsruhe
Tel. : +49 721 608-46347
Fax  : +49 721 32550
Email: Mehmet.Soysal at kit.edu
WWW : http://www.scc.kit.edu

KIT - Universität des Landes Baden-Württemberg und
nationales Forschungszentrum in der Helmholtz-Gemeinschaft