[ofa-general] [Bug 465] IPoIB CM HA fails after several hours of failures

Michael S. Tsirkin mst at dev.mellanox.co.il
Tue Mar 27 01:59:00 PDT 2007


Pls do not reply to this message.
I am copying the general list on this bug report so that
we can start discussion by mail.
I am then going to reply copying the bugzilla reflector
so that "reply all" will get tracked in bugzilla.

Subject: [Bug 465] New: IPoIB CM HA fails after several hours of failures
Date: Sun, 18 Mar 2007 08:45:48 +0200
From: bugzilla-daemon at lists.openfabrics.org

https://bugs.openfabrics.org/show_bug.cgi?id=465

           Summary: IPoIB CM HA fails after several hours of failures
           Product: OpenFabrics Linux
           Version: 1.2beta1
          Platform: X86-64
        OS/Version: All
            Status: NEW
          Severity: critical
          Priority: P2
         Component: IPoIB
        AssignedTo: mst at mellanox.co.il
        ReportedBy: sweitzen at cisco.com
                CC: tziporet at mellanox.co.il


I've been trying IPoIB CM HA for a few weeks, and can't get it to run
overnight.  I've tried both SLES10 (LionCub DDR) and RHEL4 (LionMini SDR and
LionMini DDR).

I run netperf 2.4.1 with large socket buffers:

netperf241 -H 192.168.2.46 -D -l 36000 --  -s 349520 -S 349520 -m 65536

While netperf is running, I start flipping IB ports once every 10 seconds.

After a few hours, I sometimes see netperf throughput drop to almost zero:

Interim result: 1911.72 10^6bits/s over 2.52 seconds
Interim result: 4823.63 10^6bits/s over 1.00 seconds
Interim result: 4816.90 10^6bits/s over 1.00 seconds
Interim result: 4820.21 10^6bits/s over 1.00 seconds
Interim result: 4816.85 10^6bits/s over 1.00 seconds
Interim result: 4818.13 10^6bits/s over 1.00 seconds
Interim result:  324.99 10^6bits/s over 14.83 seconds
Interim result: 4811.39 10^6bits/s over 1.00 seconds
Interim result: 4817.64 10^6bits/s over 1.00 seconds
Interim result: 4812.06 10^6bits/s over 1.00 seconds
Interim result: 4809.26 10^6bits/s over 1.00 seconds
Interim result: 4817.21 10^6bits/s over 1.00 seconds
Interim result:   85.80 10^6bits/s over 56.14 seconds
Interim result: 1910.76 10^6bits/s over 2.52 seconds
Interim result: 4813.64 10^6bits/s over 1.00 seconds
Interim result: 4813.03 10^6bits/s over 1.00 seconds
Interim result: 4807.23 10^6bits/s over 1.00 seconds
Interim result: 4810.83 10^6bits/s over 1.00 seconds
Interim result: 4813.61 10^6bits/s over 1.00 seconds
Interim result:  272.39 10^6bits/s over 17.67 seconds
Interim result: 4816.57 10^6bits/s over 1.00 seconds
Interim result: 4810.02 10^6bits/s over 1.00 seconds
Interim result: 4809.88 10^6bits/s over 1.00 seconds
Interim result:   17.63 10^6bits/s over 278.01 seconds
Interim result:    0.21 10^6bits/s over 30.58 seconds
Interim result:    0.33 10^6bits/s over 14.20 seconds
Interim result:    0.45 10^6bits/s over 13.90 seconds
Interim result:    0.11 10^6bits/s over 56.20 seconds
Interim result:    0.34 10^6bits/s over 13.95 seconds
Interim result:    0.89 10^6bits/s over 14.21 seconds
Interim result:    0.11 10^6bits/s over 55.17 seconds
Interim result:    0.08 10^6bits/s over 56.20 seconds
Interim result:    0.20 10^6bits/s over 32.14 seconds
Interim result:    1.00 10^6bits/s over 6.30 seconds
Interim result:    0.37 10^6bits/s over 17.03 seconds
Interim result:    1.74 10^6bits/s over 7.25 seconds
Interim result:    0.02 10^6bits/s over 345.16 seconds
Interim result:    0.10 10^6bits/s over 112.83 seconds
Interim result:    0.45 10^6bits/s over 13.91 seconds
Interim result:    0.68 10^6bits/s over 6.91 seconds
Interim result:    0.06 10^6bits/s over 112.48 seconds
Interim result:    0.10 10^6bits/s over 60.32 seconds
Interim result:    0.43 10^6bits/s over 14.55 seconds

Other times netperf hangs or fails.

Restarting netperf as is never works.  Sometimes I can restart netperf with
default socket buffer sizes.

----- End forwarded message -----

-- 
MST



More information about the general mailing list