[ofa-general] GPFS node loses IB-connection

SEGERS Koen Koen.SEGERS at VRT.BE
Thu May 24 11:03:22 PDT 2007


After changing the switch timeout value, we never got the error again. Yesterday, we started a 24h stresstest. This test was succesfull. I think we can conclude that the problem is fixed now.
 
But, there is a strange message in de logs of the switch:

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=xx

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=xx

Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

 

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM DELETE_MC_GROUP trap for GID=yy

Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM CREATE_MC_GROUP trap for GID=yy

Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by multicast membership change

 

With xx,yy = (e.g) ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to different GIDs in the next group of loggings each belonging to the IB ports of the server HCA's.

This logging occurs every few minutes (not at a regular interval). Is there somewhere a Cisco manual available that describes or explains these messages? Or can anyone explain what is happening? And whether this can harm a setup that doesn't use multicast?

Greetz

Koen


________________________________

Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
Verzonden: wo 23/05/2007 17:40
Aan: SEGERS Koen; Hal Rosenstock
CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection



Try 20 seconds, I'm curious if if you are barely crossing the 10-second
threshold.

Scott

> -----Original Message-----
> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> Sent: Wednesday, May 23, 2007 8:39 AM
> To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> Cc: Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Subject: RE: [ofa-general] GPFS node loses IB-connection
>
> What value would you recommend then?
>
> Koen
>
> -----Oorspronkelijk bericht-----
> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> Verzonden: woensdag 23 mei 2007 17:38
> Aan: SEGERS Koen; Hal Rosenstock
> CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
>
> The boot time of the host doesn't matter for this timeout.  While the
> host is booting, the IB link is down anyway.
>
> Scott
>
> > -----Original Message-----
> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > Sent: Wednesday, May 23, 2007 8:20 AM
> > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > Cc: Clive Hall (clivhall);
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Subject: RE: [ofa-general] GPFS node loses IB-connection
> >
> > After a whole day of stresstesting with the MAD renicing
> turned on, we
> > got the error once. So I think I should raise the timeout on
> > the switch
> > also.
> >
> > It takes about 2 minutes to boot the system. Do you agree
> > that this is a
> > good value for the timeout?
> >
> > Scott,
> > Can you explain me the problem of the memlock?
> >
> > I saw that the SLES10 bug is only an issue in MVAPICH.
> Since we didn't
> > install this, the bug is not related to us. This is
> correct, isn't it?
> >
> > Greetz
> >
> > Koen
> >
> > -----Oorspronkelijk bericht-----
> > Van: Hal Rosenstock [mailto:halr at voltaire.com]
> > Verzonden: woensdag 23 mei 2007 16:12
> > Aan: Scott "Weitzenkamp (sweitzen)
> > CC: SEGERS Koen; Clive Hall (clivhall);
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> >
> > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > No C code changes, just a few config file changes
> (RENICE_IB_MAD=yes
> > in
> > > openib.conf,
> >
> > Does the host really not respond to MAD requests for over 10
> > seconds in
> > some cases ?
> >
> > -- Hal
> >
> > >  memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > SLES10 for bug 267, etc.).
> > >
> > > Scott Weitzenkamp
> > > SQA and Release Manager
> > > Server Virtualization Business Unit
> > > Cisco Systems
> > > 
> > >
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > This far, all tests seem to work.
> > > >
> > > > Thanks for the help!
> > > >
> > > > Scott,
> > > > Are there more bugfixes that cisco does in its rpms?
> > > >
> > > > Greetz
> > > >
> > > > Koen
> > > >
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > (clivhall)
> > > > CC: Shirley Ma; Ami Perlmutter; general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > It's not so much pinging every 10 seconds as expecting a
> > > > response within
> > > > 10 seconds (Clive, correct me if I'm wrong).
> > > >
> > > > You only need to do 1) or 2), not both.  Cisco configures 1)
> > > > in the OFED
> > > > binary RPMs we release at
> > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux.  I
> > > > prefer to have
> > > > the host be more responsive.
> > > >
> > > >
> > > > Scott Weitzenkamp
> > > > SQA and Release Manager
> > > > Server Virtualization Business Unit
> > > > Cisco Systems
> > > > 
> > > >
> > > > > -----Original Message-----
> > > > > From: Koen Segers [mailto:koen.segers at VRT.BE]
> > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > > general at lists.openfabrics.org;
> > general-bounces at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > >
> > > > > If I understand it wright, the switch is actually polling
> > > > > (=pinging) the
> > > > > interfaces every 10s. This means that when the interface is
> > handling
> > > > > other traffic, the poll can fail and the port could be
> > > > > considered out of
> > > > > service. My question is then: "How can the timeout be reached
> > while
> > > > > packets are being sent/received to/from the interface?"
> > > > >
> > > > > Anyway, what timeout-value would you recommend for
> us? And why?
> > > > >
> > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > 1) change the MAD niceness of the servers
> > > > > 2) change the timeout on the switches
> > > > >
> > > > > Are these changes sufficient for the HCA's to keep
> > their ports in
> > > > > PORT_ACTIVE state?
> > > > >
> > > > > Regards,
> > > > >
> > > > > Koen
> > > > >
> > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp
> > > > (sweitzen) wrote:
> > > > > > Yes, you can tune it.  Here's an example via the switch CLI:
> > > > > > 
> > > > > > SFS-7000D(config)# ib sm subnet-prefix
> fe:80:00:00:00:00:00:00
> > > > > > node-timeout <value>
> > > > > >
> > > > > > The default is 10 seconds, it can be configured up to
> > > > 2000 seconds.
> > > > > > If a HCA is completely unresponsive for longer than the
> > > > node-timeout
> > > > > > value, then we consider that HCA out of service.
> > > > > > 
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > > 
> > > > > >
> > > > > >        
> > > > > >        
> > > > > ______________________________________________________________
> > > > > >         From: Shirley Ma [mailto:xma at us.ibm.com]
> > > > > >         Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > >         To: koen.segers at VRT.BE
> > > > > >         Cc: Ami Perlmutter; general at lists.openfabrics.org;
> > > > > >         general-bounces at lists.openfabrics.org; Scott
> > Weitzenkamp
> > > > > >         (sweitzen)
> > > > > >         Subject: RE: [ofa-general] GPFS node loses
> > IB-connection
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > >         Koen,
> > > > > >        
> > > > > >         So it is most likely you hit the same bug as
> > 229 (Scott
> > > > > >         pointed out earlier). The same workaround might
> > > > work for you
> > > > > >         by renicing ib_mad as Scott suggested.
> > > > > >        
> > > > > >         I think this should be a SM query timeout
> > tunable value
> > in
> > > > > >         Cisco SM. Am I right, Scott?
> > > > > >        
> > > > > >         Thanks
> > > > > >         Shirley Ma
> > > > > >        
> > > > > >        
> > > > > >         Inactive hide details for Koen Segers
> > > > > <koen.segers at VRT.BE>Koen
> > > > > >         Segers <koen.segers at VRT.BE>
> > > > > >        
> > > > > >        
> > > > > >                                         Koen Segers
> > > > > <koen.segers at VRT.BE>
> > > > > >                                        
> > > > > >                                         05/22/07 11:14 AM
> > > > > >                                         Please respond to
> > > > > >                                         koen.segers at VRT.BE
> > > > > >                                        
> > > > > >        
> > > > > >                      To
> > > > > >        
> > > > > >         Shirley
> > > > > >         Ma/Beaverton/IBM at IBMUS
> > > > > >        
> > > > > >                      cc
> > > > > >        
> > > > > >         Ami Perlmutter
> > > > > >         <amip at dev.mellanox.co.il>,
> > > > > general at lists.openfabrics.org,
> > general-bounces at lists.openfabrics.org
> > > > > >        
> > > > > >                 Subject
> > > > > >        
> > > > > >         RE:
> > > > > >         [ofa-general]
> > > > > >         GPFS node loses
> > > > > >         IB-connection
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > >         Hi,
> > > > > >        
> > > > > >         It is the Cisco SM.
> > > > > >        
> > > > > >         SFS-7000P> show version
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > ==============================================================
> > > > > ==================
> > > > > >                                   System Version Information
> > > > > >        
> > > > > ==============================================================
> > > > > ==================
> > > > > >                   system-version : SFS-7000P TopspinOS
> > > > 2.9.0 releng
> > > > > >         #147
> > > > > >         10/25/2006 02:01:32
> > > > > >                          contact : tac at cisco.com
> > > > > >                             name : SFS-7000P
> > > > > >                         location : 170 West Tasman Drive,
> > > > > San Jose, CA
> > > > > >         95134
> > > > > >                          up-time : 11(d):7(h):49(m):3(s)
> > > > > >                      last-change : none
> > > > > >                 last-config-save : none
> > > > > >                           action : none
> > > > > >                           result : none
> > > > > >                        oper-mode : normal
> > > > > >        
> > > > > >         There is also a command that gives the SM version,
> > > > > but I can't
> > > > > >         find it
> > > > > >         right now.
> > > > > >        
> > > > > >         On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
> > > > > >         > Hello Koen,
> > > > > >         >
> > > > > >         > From the switch log, it looks a SM issue to me.
> > > > > The node was
> > > > > >         kicked
> > > > > >         > out of the membership. Which SM you are
> > using in your
> > > > > >         fabric?
> > > > > >         >
> > > > > >         > Thanks
> > > > > >         > Shirley Ma
> > > > > >         >
> > > > > >         *** Disclaimer ***
> > > > > >        
> > > > > >         Vlaamse Radio- en Televisieomroep
> > > > > >         Auguste Reyerslaan 52, 1043 Brussel
> > > > > >        
> > > > > >         nv van publiek recht
> > > > > >         BTW BE 0244.142.664
> > > > > >         RPR Brussel
> > > > > >         http://www.vrt.be/disclaimer
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > >        
> > > > > *** Disclaimer ***
> > > > >
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > >
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > > 
> > > > >
> > > > *** Disclaimer ***
> > > >
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > >
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > > 
> > > >
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > >
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> >
> > *** Disclaimer ***
> >
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> >
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> > 
> >
> *** Disclaimer ***
>
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
>
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
> 
>


*** Disclaimer ***

Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel

nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/6a40585d/attachment.html>


More information about the general mailing list