[Users] IPoIB on CentOS 6.5

Foraker, Jim foraker1 at llnl.gov
Thu Mar 19 09:17:09 PDT 2015


Peter,
     Thanks.  I’ve told our RedHat folks that the IPoIB issue is a high
priority for us.  Our bug for the qib kernel RDMA issue is 1188417, which
was closed as a duplicate of
https://bugzilla.redhat.com/show_bug.cgi?id=1171803.

     Jim

On 3/19/15, 2:04 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:

>On Wed, 18 Mar 2015 21:08:17 +0000
>"Foraker, Jim" <foraker1 at llnl.gov> wrote:
>
>>      Does ³known broken²
>
>By "known broken" I meant
> 1) several sites including ours had to back off to older or patched
> version to get sanity for IPoIB
>And
> 2) We cased this to Redhat and they've been working on a fix. Our
> support case nr for this is 01321081 and I suspect also bz1159925.
>
> work on linux-rdma:
>  [PATCH V3 FIX For-3.19 0/3] IB/ipoib: Fix multicast join flow
>  https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg22511.html
>
>> mean Mehmet¹s case where IPoIB dies after an
>> opensm failover, or broken in other ways?
>
>And the failure mode is essentially that islands of connectivity form
>as the SM is restarted (a secondary symptom is that the ib_ipoib module
>cannot be unloaded once broken / after sm restart).
>
>Here's a step by step way that shows the problem on one of our system
>(written by a colleague):
>
>--- begin example
>
>IPoIB does not handle subnet manager restarts.
>
>I will show this using an example from yesterday:
>
>  n[464-472] ran CentOS 6.5
>  n[564-572] ran CentOS 6.6
>
>The IPoIB interface ib0 was down on all nodes, and we had just restarted
>OpenSM.
>
>Step 1: Bring up IPoIB on 7 nodes running 6.5 and 7 nodes running 6.6:
>
>    # pdsh -w "n[564-570],n[464-470]" ifup ib0
>
>Step 2: Verify connectivity
>
>  All nodes can ping all other nodes:
>  
>    # pdsh -w "n[564-570],n[464-470]" coping -o -e
>"ni[564-570],ni[464-470]"|pshbak -c
>    ----------------
>    n[464-470,564-570]
>    ----------------
>    2014-12-11 15:03:54  ni[464-470,564-570]  initially up
>
>Step 3: Restart OpenSM
>
>Step 4: Verify connectivity again:
>
>  Still OK:
>    # pdsh -w "n[564-570],n[464-470]" coping -o -e
>"ni[564-570],ni[464-470]"|pshbak -c
>    ----------------
>    n[464-470,564-570]
>    ----------------
>    2014-12-11 15:07:01  ni[464-470,564-570]  initially up
>
>Step 5: Start IPoIB on 4 additional nodes (two 6.5 and two 6.6):
>
>    # pdsh -w "n[571-572,471,472]" ifup ib0
>
>Step 6: Verify connectivity:
>
>  Broken:
>  * 6.6 nodes started in Step 1 can still ping all nodes from Step 1,
>    but not the nodes started in Step 5.
>  * 6.5 nodes started in Step 1 can ping everything.
>  * Nodes from Step 5 can ping each other, but only the 6.5 nodes from
>    Step 1, not the 6.6.
>
>    [root at trio yum.repos.d]# pdsh -w "n[564-572],n[464-472]" coping -o -e
>"ni[564-572],ni[464-472]"|sort|pshbak -c
>    ----------------
>    n[564-570]
>    ----------------
>    2014-12-11 15:08:24  ni[464-470,564-570]  initially up
>    2014-12-11 15:08:24  ni[471-472,571-572]  initially DOWN
>    ----------------
>    n[464-470]
>    ----------------
>    2014-12-11 15:08:24  ni[464-472,564-572]  initially up
>    ----------------
>    n[471-472,571-572]
>    ----------------
>    2014-12-11 15:08:24  ni[464-472,571-572]  initially up
>    2014-12-11 15:08:24  ni[564-570]  initially DOWN
>
>---- end example
>
>>  The only issue we¹ve seen
>> with IPoIB in RHEL 6.6 has been a bug with QIB hardware and
>> kernel-based RDMA (Lustre, SRP). Is there a RHEL bugzilla bug open on
>> the issue(s)?
>
>For bug id and possible bz see beginning of my e-mail.
>
>Is there a bz for the QIB bug you mentioned? (we've seen this too
>and switched to ofed-3.12-1 on that system that required lnet on qib).
>
>/Peter 
> 
>>      Jim
>> 
>> 
>> On 3/18/15, 10:30 AM, "Peter Kjellström" <cap at nsc.liu.se> wrote:
>> 
>> >On Tue, 17 Mar 2015 15:54:02 +0100
>> >Mehmet Soysal <mehmet.soysal at kit.edu> wrote:
>> >
>> >> Hi,
>> >> did you solved the problem ?
>> >> We have a similar issue since a upgrade to RHEL 6.5 or higher.
>> >> 
>> >> On our nodes ipoib is not working any longer after a opensm fail
>> >> over occurs.
>> >
>> >Actually IPoIB is known broken in rhel6 (6.5 zstream 431-x, x > 37
>> >and for 6.6 all released -504). Redhat knows this and is working on
>> >a fix (there may be a candidate fix kernel to request). Meanwhile
>> >we've rebuilt latest -504 with the ipoib from 6.5 (which works fine
>> >for us).
>> >
>> >If you're interested in our -504 pkgs with old/working ipoib contact
>> >me offlist.
>> >
>> >Since last week you also get the additional complication of the verbs
>> >CVE to take into account when picking a working setup...
>> >
>> >/Peter K
>> >_______________________________________________
>> >Users mailing list
>> >Users at lists.openfabrics.org
>> >http://lists.openfabrics.org/mailman/listinfo/users
>> >
>> 
>> 
>
>





More information about the Users mailing list