[ofa-general] GPFS node loses IB-connection

Koen Segers koen.segers at vrt.be
Tue May 22 11:17:25 PDT 2007


On Tue, 2007-05-22 at 08:34 -0700, Scott Weitzenkamp (sweitzen) wrote:
> What server model and CPU model do you have?

cat /proc/cpuinfo
processor       : 7
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 8218
stepping        : 2
cpu MHz         : 2600.202
cache size      : 1024 KB
physical id     : 3
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy
bogomips        : 5200.54
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

>  
> This could be https://bugs.openfabrics.org//show_bug.cgi?id=229.  Try
> setting RENICE_IB_MAD=yes in /etc/infiniband/openibd.conf, then reboot
> or run /etc/init.d/openibd restart, and see if that helps.

AHA, this is interesting. I'll do it tomorrow!

>  
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
>         
>         ______________________________________________________________
>         From: general-bounces at lists.openfabrics.org
>         [mailto:general-bounces at lists.openfabrics.org] On Behalf Of
>         SEGERS Koen
>         Sent: Tuesday, May 22, 2007 6:44 AM
>         To: Ami Perlmutter; Shirley Ma
>         Cc: general-bounces at lists.openfabrics.org;
>         general at lists.openfabrics.org
>         Subject: RE: [ofa-general] GPFS node loses IB-connection
>         
>         
>         
>         I did the iperf tests on servers with OFED-1.2-RC3.
>         
>          
>         
>         It also gives the same result. Actually, it is even worse:
>         when the interface dies, it gets in PORT_INIT state, but it
>         doesn’t go to PORT_ACTIVE again. At least not within 10
>         minutes.
>         
>          
>         
>         I’ll give you the test script I ran:
>         
>          
>         
>         ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 5001 &
>         
>         ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 5002 &
>         
>         ssh 10.224.158.114 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 5003 &
>         
>         ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 6001 &
>         
>         ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 6002 &
>         
>         ssh 10.224.158.115 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 6003 &
>         
>         ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 7001 &
>         
>         ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 7002 &
>         
>         ssh 10.224.158.116 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 7003 &
>         
>         ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 8001 &
>         
>         ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 8002 &
>         
>         ssh 10.224.158.117 LD_PRELOAD=libsdp.so SIMPLE_LIBSDP=OK iperf
>         -s -p 8003 &
>         
>          
>         
>         sleep 5
>         
>          
>         
>         for i in 14 15 16 17
>         
>         do
>         
>                 ssh 10.224.158.111 LD_PRELOAD=libsdp.so
>         SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))001 -t 120
>         -d -P 5 &
>         
>                 ssh 10.224.158.112 LD_PRELOAD=libsdp.so
>         SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))002 -t 120
>         -d -P 5 &
>         
>                 ssh 10.224.158.113 LD_PRELOAD=libsdp.so
>         SIMPLE_LIBSDP=OK iperf -c 192.168.2.$i -p $((i-9))003 -t 120
>         -d -P 5 &
>         
>         done
>         
>          
>         
>         Any ideas?
>         
>          
>         
>         Regards,
>         
>          
>         
>         Koen
>         
>                                        
>         ______________________________________________________________
>         Van: general-bounces at lists.openfabrics.org
>         [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS
>         Koen
>         Verzonden: dinsdag 22 mei 2007 10:55
>         Aan: Ami Perlmutter; Shirley Ma
>         CC: general-bounces at lists.openfabrics.org;
>         general at lists.openfabrics.org
>         Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
>         
>         
>          
>         
>         GPFS keeps its connection constantly open.
>         
>          
>         
>         We did some more tests with iperf:
>         
>         If we don’t run bidirectional tests, all connections keeps
>         running smoothly. If we add bidirectional tests, it becomes
>         unstable. Certainly if this is done on multiple nodes. Is this
>         normal?
>         
>          
>         
>         The failed iperf tests give the same error in the switch log:
>         
>         May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Generate SM OUT_OF_SERVICE trap for
>         GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71
>         
>         May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Generate SM DELETE_MC_GROUP trap for
>         GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71
>         
>         May 22 08:14:59 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Configuration caused by discovering removed ports
>         
>         May 22 08:15:00 topspin-120sc ib_sm.x[621]: %IB-6-INFO:
>         Program switch port state to down,
>         node=00:05:ad:00:00:0b:a2:cc, port= 6, due to non-responding
>         CA
>         
>         May 22 08:15:00 topspin-120sc port_mgr.x[497]: %PORT-6-INFO:
>         port down - port=1/6, type=ib4xTXP
>         
>         May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO:
>         in portTblFindEntry() - IfIndex=70(1/6)
>         
>         May 22 08:15:00 topspin-120sc diag_mgr.x[508]: %DIAG-6-INFO:
>         cannot find entry - IfIndex=70(1/6)
>         
>         May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Configuration caused by discovering new ports
>         
>         May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Configuration caused by multicast membership change
>         
>         May 22 08:15:04 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Generate SM IN_SERVICE trap for
>         GID=fe:80:00:00:00:00:00:00:00:05:ad:00:00:08:a8:71
>         
>         May 22 08:15:05 topspin-120sc port_mgr.x[497]: %PORT-6-INFO:
>         port up - port=1/6, type=ib4xTXP
>         
>         May 22 08:15:07 topspin-120sc ib_sm.x[632]: %IB-6-INFO:
>         Generate SM CREATE_MC_GROUP trap for
>         GID=ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:08:a8:71
>         
>         May 22 08:15:08 topspin-120sc ib_sm.x[618]: %IB-6-INFO:
>         Configuration caused by multicast membership change
>         
>          
>         
>         RC3 is just installed. Results will follow soon.
>         
>          
>         
>         Regards,
>         
>          
>         
>         Koen
>         
>          
>         
>                                        
>         ______________________________________________________________
>         Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il] 
>         Verzonden: dinsdag 22 mei 2007 10:33
>         Aan: Shirley Ma
>         CC: SEGERS Koen; general-bounces at lists.openfabrics.org;
>         general at lists.openfabrics.org
>         Onderwerp: Re: [ofa-general] GPFS node loses IB-connection
>         
>         
>          
>         
>         does the application constantly open and close connections? 
>         
>         *** Disclaimer ***
>         
>         Vlaamse Radio- en Televisieomroep
>         Auguste Reyerslaan 52, 1043 Brussel
>         
>         nv van publiek recht
>         BTW BE 0244.142.664
>         RPR Brussel
>         http://www.vrt.be/disclaimer
>         
>         
>         *** Disclaimer ***
>         
>         Vlaamse Radio- en Televisieomroep
>         Auguste Reyerslaan 52, 1043 Brussel
>         
>         nv van publiek recht
>         BTW BE 0244.142.664
>         RPR Brussel
>         http://www.vrt.be/disclaimer
>         
*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 




More information about the general mailing list