[ofa-general] GPFS node loses IB-connection

Scott Weitzenkamp (sweitzen) sweitzen at cisco.com
Tue May 22 15:38:48 PDT 2007


It's not so much pinging every 10 seconds as expecting a response within
10 seconds (Clive, correct me if I'm wrong).

You only need to do 1) or 2), not both.  Cisco configures 1) in the OFED
binary RPMs we release at
http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I prefer to have
the host be more responsive.


Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Koen Segers [mailto:koen.segers at VRT.BE] 
> Sent: Tuesday, May 22, 2007 3:35 PM
> To: Scott Weitzenkamp (sweitzen)
> Cc: Shirley Ma; Ami Perlmutter; 
> general at lists.openfabrics.org; general-bounces at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
> 
> If I understand it wright, the switch is actually polling 
> (=pinging) the
> interfaces every 10s. This means that when the interface is handling
> other traffic, the poll can fail and the port could be 
> considered out of
> service. My question is then: "How can the timeout be reached while
> packets are being sent/received to/from the interface?"
> 
> Anyway, what timeout-value would you recommend for us? And why?
> 
> To recapitulate: these are the actions I'll take tomorrow
> 1) change the MAD niceness of the servers
> 2) change the timeout on the switches
> 
> Are these changes sufficient for the HCA's to keep their ports in
> PORT_ACTIVE state?
> 
> Regards,
> 
> Koen
> 
> On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp (sweitzen) wrote:
> > Yes, you can tune it.  Here's an example via the switch CLI:
> >  
> > SFS-7000D(config)# ib sm subnet-prefix fe:80:00:00:00:00:00:00
> > node-timeout <value>
> > 
> > The default is 10 seconds, it can be configured up to 2000 seconds.
> > If a HCA is completely unresponsive for longer than the node-timeout
> > value, then we consider that HCA out of service.
> >  
> > Scott Weitzenkamp
> > SQA and Release Manager
> > Server Virtualization Business Unit
> > Cisco Systems
> >  
> > 
> >         
> >         
> ______________________________________________________________
> >         From: Shirley Ma [mailto:xma at us.ibm.com] 
> >         Sent: Tuesday, May 22, 2007 11:30 AM
> >         To: koen.segers at VRT.BE
> >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> >         general-bounces at lists.openfabrics.org; Scott Weitzenkamp
> >         (sweitzen)
> >         Subject: RE: [ofa-general] GPFS node loses IB-connection
> >         
> >         
> >         
> >         Koen,
> >         
> >         So it is most likely you hit the same bug as 229 (Scott
> >         pointed out earlier). The same workaround might work for you
> >         by renicing ib_mad as Scott suggested.
> >         
> >         I think this should be a SM query timeout tunable value in
> >         Cisco SM. Am I right, Scott?
> >         
> >         Thanks
> >         Shirley Ma
> >         
> >         
> >         Inactive hide details for Koen Segers 
> <koen.segers at VRT.BE>Koen
> >         Segers <koen.segers at VRT.BE>
> >         
> >         
> >                                         Koen Segers 
> <koen.segers at VRT.BE> 
> >                                         
> >                                         05/22/07 11:14 AM 
> >                                         Please respond to
> >                                         koen.segers at VRT.BE
> >                                         
> >         
> >                      To
> >         
> >         Shirley
> >         Ma/Beaverton/IBM at IBMUS
> >         
> >                      cc
> >         
> >         Ami Perlmutter
> >         <amip at dev.mellanox.co.il>, 
> general at lists.openfabrics.org, general-bounces at lists.openfabrics.org
> >         
> >                 Subject
> >         
> >         RE:
> >         [ofa-general]
> >         GPFS node loses
> >         IB-connection
> >         
> >         
> >         
> >         Hi,
> >         
> >         It is the Cisco SM. 
> >         
> >         SFS-7000P> show version
> >         
> >         
> >         
> ==============================================================
> ==================
> >                                   System Version Information
> >         
> ==============================================================
> ==================
> >                   system-version : SFS-7000P TopspinOS 2.9.0 releng
> >         #147
> >         10/25/2006 02:01:32
> >                          contact : tac at cisco.com
> >                             name : SFS-7000P
> >                         location : 170 West Tasman Drive, 
> San Jose, CA
> >         95134
> >                          up-time : 11(d):7(h):49(m):3(s)
> >                      last-change : none
> >                 last-config-save : none
> >                           action : none
> >                           result : none
> >                        oper-mode : normal
> >         
> >         There is also a command that gives the SM version, 
> but I can't
> >         find it
> >         right now. 
> >         
> >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> >         > Hello Koen,
> >         > 
> >         > From the switch log, it looks a SM issue to me. 
> The node was
> >         kicked
> >         > out of the membership. Which SM you are using in your
> >         fabric? 
> >         > 
> >         > Thanks
> >         > Shirley Ma
> >         > 
> >         *** Disclaimer ***
> >         
> >         Vlaamse Radio- en Televisieomroep
> >         Auguste Reyerslaan 52, 1043 Brussel
> >         
> >         nv van publiek recht
> >         BTW BE 0244.142.664
> >         RPR Brussel
> >         http://www.vrt.be/disclaimer
> >         
> >         
> >         
> >         
> >         
> *** Disclaimer ***
> 
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
> 
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>  
> 



More information about the general mailing list