[ofa-general] GPFS node loses IB-connection

Scott Weitzenkamp (sweitzen) sweitzen at cisco.com
Wed May 23 06:51:55 PDT 2007


No C code changes, just a few config file changes (RENICE_IB_MAD=yes in
openib.conf, memlock in /etc/security/limits.conf, fix /etc/hosts on
SLES10 for bug 267, etc.).

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE] 
> Sent: Wednesday, May 23, 2007 6:48 AM
> To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> Cc: Shirley Ma; Ami Perlmutter; 
> general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> This far, all tests seem to work.
> 
> Thanks for the help!
> 
> Scott,
> Are there more bugfixes that cisco does in its rpms?
> 
> Greetz
> 
> Koen
> 
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> Verzonden: woensdag 23 mei 2007 0:39
> Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> general-bounces at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> 
> It's not so much pinging every 10 seconds as expecting a 
> response within
> 10 seconds (Clive, correct me if I'm wrong).
> 
> You only need to do 1) or 2), not both.  Cisco configures 1) 
> in the OFED
> binary RPMs we release at
> http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I 
> prefer to have
> the host be more responsive.
> 
> 
> Scott Weitzenkamp
> SQA and Release Manager
> Server Virtualization Business Unit
> Cisco Systems
>  
> 
> > -----Original Message-----
> > From: Koen Segers [mailto:koen.segers at VRT.BE] 
> > Sent: Tuesday, May 22, 2007 3:35 PM
> > To: Scott Weitzenkamp (sweitzen)
> > Cc: Shirley Ma; Ami Perlmutter; 
> > general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > 
> > If I understand it wright, the switch is actually polling 
> > (=pinging) the
> > interfaces every 10s. This means that when the interface is handling
> > other traffic, the poll can fail and the port could be 
> > considered out of
> > service. My question is then: "How can the timeout be reached while
> > packets are being sent/received to/from the interface?"
> > 
> > Anyway, what timeout-value would you recommend for us? And why?
> > 
> > To recapitulate: these are the actions I'll take tomorrow
> > 1) change the MAD niceness of the servers
> > 2) change the timeout on the switches
> > 
> > Are these changes sufficient for the HCA's to keep their ports in
> > PORT_ACTIVE state?
> > 
> > Regards,
> > 
> > Koen
> > 
> > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp 
> (sweitzen) wrote:
> > > Yes, you can tune it.  Here's an example via the switch CLI:
> > >  
> > > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > > node-timeout <value>
> > > 
> > > The default is 10 seconds, it can be configured up to 
> 2000 seconds.
> > > If a HCA is completely unresponsive for longer than the 
> node-timeout
> > > value, then we consider that HCA out of service.
> > >  
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > >  
> > > 
> > >         
> > >         
> > ______________________________________________________________
> > >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > >         To: koen.segers at VRT.BE
> > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > >         general-bounces at lists.openfabrics.org; Scott Weitzenkamp
> > >         (sweitzen)
> > >         Subject: RE: [ofa-general] GPFS node loses IB-connection
> > >         
> > >         
> > >         
> > >         Koen,
> > >         
> > >         So it is most likely you hit the same bug as 229 (Scott
> > >         pointed out earlier). The same workaround might 
> work for you
> > >         by renicing ib_mad as Scott suggested.
> > >         
> > >         I think this should be a SM query timeout tunable value in
> > >         Cisco SM. Am I right, Scott?
> > >         
> > >         Thanks
> > >         Shirley Ma
> > >         
> > >         
> > >         Inactive hide details for Koen Segers 
> > <koen.segers at VRT.BE>Koen
> > >         Segers <koen.segers at VRT.BE>
> > >         
> > >         
> > >                                         Koen Segers 
> > <koen.segers at VRT.BE> 
> > >                                         
> > >                                         05/22/07 11:14 AM 
> > >                                         Please respond to
> > >                                         koen.segers at VRT.BE
> > >                                         
> > >         
> > >                      To
> > >         
> > >         Shirley
> > >         Ma/Beaverton/IBM at IBMUS
> > >         
> > >                      cc
> > >         
> > >         Ami Perlmutter
> > >         <amip at dev.mellanox.co.il>, 
> > general at lists.openfabrics.org, general-bounces at lists.openfabrics.org
> > >         
> > >                 Subject
> > >         
> > >         RE:
> > >         [ofa-general]
> > >         GPFS node loses
> > >         IB-connection
> > >         
> > >         
> > >         
> > >         Hi,
> > >         
> > >         It is the Cisco SM. 
> > >         
> > >         SFS-7000P> show version
> > >         
> > >         
> > >         
> > ==============================================================
> > ==================
> > >                                   System Version Information
> > >         
> > ==============================================================
> > ==================
> > >                   system-version : SFS-7000P TopspinOS 
> 2.9.0 releng
> > >         #147
> > >         10/25/2006 02:01:32
> > >                          contact : tac at cisco.com
> > >                             name : SFS-7000P
> > >                         location : 170 West Tasman Drive, 
> > San Jose, CA
> > >         95134
> > >                          up-time : 11(d):7(h):49(m):3(s)
> > >                      last-change : none
> > >                 last-config-save : none
> > >                           action : none
> > >                           result : none
> > >                        oper-mode : normal
> > >         
> > >         There is also a command that gives the SM version, 
> > but I can't
> > >         find it
> > >         right now. 
> > >         
> > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > >         > Hello Koen,
> > >         > 
> > >         > From the switch log, it looks a SM issue to me. 
> > The node was
> > >         kicked
> > >         > out of the membership. Which SM you are using in your
> > >         fabric? 
> > >         > 
> > >         > Thanks
> > >         > Shirley Ma
> > >         > 
> > >         *** Disclaimer ***
> > >         
> > >         Vlaamse Radio- en Televisieomroep
> > >         Auguste Reyerslaan 52, 1043 Brussel
> > >         
> > >         nv van publiek recht
> > >         BTW BE 0244.142.664
> > >         RPR Brussel
> > >         http://www.vrt.be/disclaimer
> > >         
> > >         
> > >         
> > >         
> > >         
> > *** Disclaimer ***
> > 
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> > 
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >  
> > 
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 



More information about the general mailing list