[ofa-general] GPFS node loses IB-connection

Clive Hall (clivhall) clivhall at cisco.com
Thu May 24 13:37:48 PDT 2007


Those particular log messages are just informational messages.  They're
logged when multicast groups are created (when the first group member
joins) and when multicast groups are deleted (when the last group member
leaves).
 
As Shirley said, if you're not using IPv6 anyway then those messages
aren't harmful.  Even if you are using IPv6 it will quite possibly still
be fine, although I don't know why hosts would be leaving/rejoining the
multicast groups.
 
Clive.
 



________________________________

	From: general-bounces at lists.openfabrics.org
[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Shirley Ma
	Sent: Thursday, May 24, 2007 11:16 AM
	To: SEGERS Koen
	Cc: general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	Subject: RE: [ofa-general] GPFS node loses IB-connection
	
	

	Koen,
	
	Are you using IPv6? If not, then this is no harmful. If you
don't use it, you can simply disable loading IPv6 module in your notes
when rebooting.
	
	Thanks
	Shirley Ma
	IBM Linux Technology Center
	15300 SW Koll Parkway
	Beaverton, OR 97006-6063
	Phone(Fax): (503) 578-7638
	
	
	 "SEGERS Koen" <Koen.SEGERS at VRT.BE>
	
	
	

				"SEGERS Koen" <Koen.SEGERS at VRT.BE> 
				Sent by:
general-bounces at lists.openfabrics.org 

				05/24/07 11:03 AM

 

To

"Scott Weitzenkamp (sweitzen)" <sweitzen at cisco.com>, "Hal Rosenstock"
<halr at voltaire.com>	


cc

general-bounces at lists.openfabrics.org, general at lists.openfabrics.org	


Subject

RE: [ofa-general] GPFS node loses IB-connection	
	 	

	After changing the switch timeout value, we never got the error
again. Yesterday, we started a 24h stresstest. This test was succesfull.
I think we can conclude that the problem is fixed now.
	
	But, there is a strange message in de logs of the switch: 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for GID=xx 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by
multicast membership change 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by
multicast membership change 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
DELETE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[632]: %IB-6-INFO: Generate SM
CREATE_MC_GROUP trap for GID=yy 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by
multicast membership change 

	Topspin-120sc ib_sm.x[618]: %IB-6-INFO: Configuration caused by
multicast membership change 

	With xx,yy = (e.g)
ff:12:60:1b:ff:ff:00:00:00:00:00:01:ff:05:87:d9 but changing to
different GIDs in the next group of loggings each belonging to the IB
ports of the server HCA's. 

	This logging occurs every few minutes (not at a regular
interval). Is there somewhere a Cisco manual available that describes or
explains these messages? Or can anyone explain what is happening? And
whether this can harm a setup that doesn't use multicast? 

	Greetz 

	Koen 

	
	
________________________________

	Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
	Verzonden: wo 23/05/2007 17:40
	Aan: SEGERS Koen; Hal Rosenstock
	CC: Clive Hall (clivhall);
general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
	Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	

	Try 20 seconds, I'm curious if if you are barely crossing the
10-second
	threshold.
	
	Scott
	
	> -----Original Message-----
	> From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE
<mailto:Koen.SEGERS at VRT.BE> ]
	> Sent: Wednesday, May 23, 2007 8:39 AM
	> To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
	> Cc: Clive Hall (clivhall);
	> general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	> Subject: RE: [ofa-general] GPFS node loses IB-connection
	>
	> What value would you recommend then?
	>
	> Koen
	>
	> -----Oorspronkelijk bericht-----
	> Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com
<mailto:sweitzen at cisco.com> ]
	> Verzonden: woensdag 23 mei 2007 17:38
	> Aan: SEGERS Koen; Hal Rosenstock
	> CC: Clive Hall (clivhall);
general-bounces at lists.openfabrics.org;
	> general at lists.openfabrics.org
	> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	>
	> The boot time of the host doesn't matter for this timeout.
While the
	> host is booting, the IB link is down anyway.
	>
	> Scott
	>
	> > -----Original Message-----
	> > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE
<mailto:Koen.SEGERS at VRT.BE> ]
	> > Sent: Wednesday, May 23, 2007 8:20 AM
	> > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
	> > Cc: Clive Hall (clivhall);
	> > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	> > Subject: RE: [ofa-general] GPFS node loses IB-connection
	> >
	> > After a whole day of stresstesting with the MAD renicing
	> turned on, we
	> > got the error once. So I think I should raise the timeout on
	> > the switch
	> > also.
	> >
	> > It takes about 2 minutes to boot the system. Do you agree
	> > that this is a
	> > good value for the timeout?
	> >
	> > Scott,
	> > Can you explain me the problem of the memlock?
	> >
	> > I saw that the SLES10 bug is only an issue in MVAPICH.
	> Since we didn't
	> > install this, the bug is not related to us. This is
	> correct, isn't it?
	> >
	> > Greetz
	> >
	> > Koen
	> >
	> > -----Oorspronkelijk bericht-----
	> > Van: Hal Rosenstock [mailto:halr at voltaire.com
<mailto:halr at voltaire.com> ]
	> > Verzonden: woensdag 23 mei 2007 16:12
	> > Aan: Scott "Weitzenkamp (sweitzen)
	> > CC: SEGERS Koen; Clive Hall (clivhall);
	> > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
	> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
	> >
	> > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen)
wrote:
	> > > No C code changes, just a few config file changes
	> (RENICE_IB_MAD=yes
	> > in
	> > > openib.conf,
	> >
	> > Does the host really not respond to MAD requests for over 10
	> > seconds in
	> > some cases ?
	> >
	> > -- Hal
	> >
	> > > memlock in /etc/security/limits.conf, fix /etc/hosts on
	> > > SLES10 for bug 267, etc.).
	> > >
	> > > Scott Weitzenkamp
	> > > SQA and Release Manager
	> > > Server Virtualization Business Unit
	> > > Cisco Systems
	> > > 
	> > >
	> > > > -----Original Message-----
	> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE
<mailto:Koen.SEGERS at VRT.BE> ]
	> > > > Sent: Wednesday, May 23, 2007 6:48 AM
	> > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
	> > > > Cc: Shirley Ma; Ami Perlmutter;
	> > > > general at lists.openfabrics.org;
	> > general-bounces at lists.openfabrics.org
	> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
	> > > >
	> > > > This far, all tests seem to work.
	> > > >
	> > > > Thanks for the help!
	> > > >
	> > > > Scott,
	> > > > Are there more bugfixes that cisco does in its rpms?
	> > > >
	> > > > Greetz
	> > > >
	> > > > Koen
	> > > >
	> > > > -----Oorspronkelijk bericht-----
	> > > > Van: Scott Weitzenkamp (sweitzen) [
mailto:sweitzen at cisco.com <mailto:sweitzen at cisco.com> ]
	> > > > Verzonden: woensdag 23 mei 2007 0:39
	> > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive
Hall
	> > (clivhall)
	> > > > CC: Shirley Ma; Ami Perlmutter;
general at lists.openfabrics.org;
	> > > > general-bounces at lists.openfabrics.org
	> > > > Onderwerp: RE: [ofa-general] GPFS node loses
IB-connection
	> > > >
	> > > > It's not so much pinging every 10 seconds as expecting a
	> > > > response within
	> > > > 10 seconds (Clive, correct me if I'm wrong).
	> > > >
	> > > > You only need to do 1) or 2), not both. Cisco configures
1)
	> > > > in the OFED
	> > > > binary RPMs we release at
	> > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux
<http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux> . I
	> > > > prefer to have
	> > > > the host be more responsive.
	> > > >
	> > > >
	> > > > Scott Weitzenkamp
	> > > > SQA and Release Manager
	> > > > Server Virtualization Business Unit
	> > > > Cisco Systems
	> > > > 
	> > > >
	> > > > > -----Original Message-----
	> > > > > From: Koen Segers [mailto:koen.segers at VRT.BE
<mailto:koen.segers at VRT.BE> ]
	> > > > > Sent: Tuesday, May 22, 2007 3:35 PM
	> > > > > To: Scott Weitzenkamp (sweitzen)
	> > > > > Cc: Shirley Ma; Ami Perlmutter;
	> > > > > general at lists.openfabrics.org;
	> > general-bounces at lists.openfabrics.org
	> > > > > Subject: RE: [ofa-general] GPFS node loses
IB-connection
	> > > > >
	> > > > > If I understand it wright, the switch is actually
polling
	> > > > > (=pinging) the
	> > > > > interfaces every 10s. This means that when the
interface is
	> > handling
	> > > > > other traffic, the poll can fail and the port could be
	> > > > > considered out of
	> > > > > service. My question is then: "How can the timeout be
reached
	> > while
	> > > > > packets are being sent/received to/from the
interface?"
	> > > > >
	> > > > > Anyway, what timeout-value would you recommend for
	> us? And why?
	> > > > >
	> > > > > To recapitulate: these are the actions I'll take
tomorrow
	> > > > > 1) change the MAD niceness of the servers
	> > > > > 2) change the timeout on the switches
	> > > > >
	> > > > > Are these changes sufficient for the HCA's to keep
	> > their ports in
	> > > > > PORT_ACTIVE state?
	> > > > >
	> > > > > Regards,
	> > > > >
	> > > > > Koen
	> > > > >
	> > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp
	> > > > (sweitzen) wrote:
	> > > > > > Yes, you can tune it. Here's an example via the
switch CLI:
	> > > > > > 
	> > > > > > SFS-7000D(config)# ib sm subnet-prefix
	> fe:80:00:00:00:00:00:00
	> > > > > > node-timeout <value>
	> > > > > >
	> > > > > > The default is 10 seconds, it can be configured up
to
	> > > > 2000 seconds.
	> > > > > > If a HCA is completely unresponsive for longer than
the
	> > > > node-timeout
	> > > > > > value, then we consider that HCA out of service.
	> > > > > > 
	> > > > > > Scott Weitzenkamp
	> > > > > > SQA and Release Manager
	> > > > > > Server Virtualization Business Unit
	> > > > > > Cisco Systems
	> > > > > > 
	> > > > > >
	> > > > > > 
	> > > > > > 
	> > > > >
______________________________________________________________
	> > > > > > From: Shirley Ma [mailto:xma at us.ibm.com
<mailto:xma at us.ibm.com> ]
	> > > > > > Sent: Tuesday, May 22, 2007 11:30 AM
	> > > > > > To: koen.segers at VRT.BE
	> > > > > > Cc: Ami Perlmutter; general at lists.openfabrics.org;
	> > > > > > general-bounces at lists.openfabrics.org; Scott
	> > Weitzenkamp
	> > > > > > (sweitzen)
	> > > > > > Subject: RE: [ofa-general] GPFS node loses
	> > IB-connection
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > Koen,
	> > > > > > 
	> > > > > > So it is most likely you hit the same bug as
	> > 229 (Scott
	> > > > > > pointed out earlier). The same workaround might
	> > > > work for you
	> > > > > > by renicing ib_mad as Scott suggested.
	> > > > > > 
	> > > > > > I think this should be a SM query timeout
	> > tunable value
	> > in
	> > > > > > Cisco SM. Am I right, Scott?
	> > > > > > 
	> > > > > > Thanks
	> > > > > > Shirley Ma
	> > > > > > 
	> > > > > > 
	> > > > > > Inactive hide details for Koen Segers
	> > > > > <koen.segers at VRT.BE>Koen
	> > > > > > Segers <koen.segers at VRT.BE>
	> > > > > > 
	> > > > > > 
	> > > > > > Koen Segers
	> > > > > <koen.segers at VRT.BE>
	> > > > > > 
	> > > > > > 05/22/07 11:14 AM
	> > > > > > Please respond to
	> > > > > > koen.segers at VRT.BE
	> > > > > > 
	> > > > > > 
	> > > > > > To
	> > > > > > 
	> > > > > > Shirley
	> > > > > > Ma/Beaverton/IBM at IBMUS
	> > > > > > 
	> > > > > > cc
	> > > > > > 
	> > > > > > Ami Perlmutter
	> > > > > > <amip at dev.mellanox.co.il>,
	> > > > > general at lists.openfabrics.org,
	> > general-bounces at lists.openfabrics.org
	> > > > > > 
	> > > > > > Subject
	> > > > > > 
	> > > > > > RE:
	> > > > > > [ofa-general]
	> > > > > > GPFS node loses
	> > > > > > IB-connection
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > Hi,
	> > > > > > 
	> > > > > > It is the Cisco SM.
	> > > > > > 
	> > > > > > SFS-7000P> show version
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > >
==============================================================
	> > > > > ==================
	> > > > > > System Version Information
	> > > > > > 
	> > > > >
==============================================================
	> > > > > ==================
	> > > > > > system-version : SFS-7000P TopspinOS
	> > > > 2.9.0 releng
	> > > > > > #147
	> > > > > > 10/25/2006 02:01:32
	> > > > > > contact : tac at cisco.com
	> > > > > > name : SFS-7000P
	> > > > > > location : 170 West Tasman Drive,
	> > > > > San Jose, CA
	> > > > > > 95134
	> > > > > > up-time : 11(d):7(h):49(m):3(s)
	> > > > > > last-change : none
	> > > > > > last-config-save : none
	> > > > > > action : none
	> > > > > > result : none
	> > > > > > oper-mode : normal
	> > > > > > 
	> > > > > > There is also a command that gives the SM version,
	> > > > > but I can't
	> > > > > > find it
	> > > > > > right now.
	> > > > > > 
	> > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma wrote:
	> > > > > > > Hello Koen,
	> > > > > > >
	> > > > > > > From the switch log, it looks a SM issue to me.
	> > > > > The node was
	> > > > > > kicked
	> > > > > > > out of the membership. Which SM you are
	> > using in your
	> > > > > > fabric?
	> > > > > > >
	> > > > > > > Thanks
	> > > > > > > Shirley Ma
	> > > > > > >
	> > > > > > *** Disclaimer ***
	> > > > > > 
	> > > > > > Vlaamse Radio- en Televisieomroep
	> > > > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > > > > 
	> > > > > > nv van publiek recht
	> > > > > > BTW BE 0244.142.664
	> > > > > > RPR Brussel
	> > > > > > http://www.vrt.be/disclaimer
<http://www.vrt.be/disclaimer> 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > > 
	> > > > > *** Disclaimer ***
	> > > > >
	> > > > > Vlaamse Radio- en Televisieomroep
	> > > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > > >
	> > > > > nv van publiek recht
	> > > > > BTW BE 0244.142.664
	> > > > > RPR Brussel
	> > > > > http://www.vrt.be/disclaimer
<http://www.vrt.be/disclaimer> 
	> > > > > 
	> > > > >
	> > > > *** Disclaimer ***
	> > > >
	> > > > Vlaamse Radio- en Televisieomroep
	> > > > Auguste Reyerslaan 52, 1043 Brussel
	> > > >
	> > > > nv van publiek recht
	> > > > BTW BE 0244.142.664
	> > > > RPR Brussel
	> > > > http://www.vrt.be/disclaimer
<http://www.vrt.be/disclaimer> 
	> > > > 
	> > > >
	> > > _______________________________________________
	> > > general mailing list
	> > > general at lists.openfabrics.org
	> > > 
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
<http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general> 
	> > >
	> > > To unsubscribe, please visit
	> > http://openib.org/mailman/listinfo/openib-general
<http://openib.org/mailman/listinfo/openib-general> 
	> >
	> > *** Disclaimer ***
	> >
	> > Vlaamse Radio- en Televisieomroep
	> > Auguste Reyerslaan 52, 1043 Brussel
	> >
	> > nv van publiek recht
	> > BTW BE 0244.142.664
	> > RPR Brussel
	> > http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> > 
	> >
	> *** Disclaimer ***
	>
	> Vlaamse Radio- en Televisieomroep
	> Auguste Reyerslaan 52, 1043 Brussel
	>
	> nv van publiek recht
	> BTW BE 0244.142.664
	> RPR Brussel
	> http://www.vrt.be/disclaimer <http://www.vrt.be/disclaimer> 
	> 
	> 

	*** Disclaimer ***
	
	Vlaamse Radio- en Televisieomroep
	Auguste Reyerslaan 52, 1043 Brussel
	
	nv van publiek recht
	BTW BE 0244.142.664
	RPR Brussel
	http://www.vrt.be/disclaimer
	_______________________________________________
	general mailing list
	general at lists.openfabrics.org
	http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
	
	To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general 

	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/a98af1fe/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: graycol.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/a98af1fe/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: ecblank.gif
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070524/a98af1fe/attachment-0001.gif>


More information about the general mailing list