[ofa-general] GPFS node loses IB-connection
Ami Perlmutter
amip at dev.mellanox.co.il
Tue May 29 08:05:24 PDT 2007
any chance of moving to rc3 (or wait till rc4)?
On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote:
> We don't really see data getting lost. We don't get an error in the log
> files of gpfs. We only got a system that was not able to read its
> filesystem anymore. It was exactly at the time this FIXME error
> occurred.
>
> Therefore I think there must me some kind of correlation. But I don't
> really know what ... :(
>
> Koen
>
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il]
> Verzonden: dinsdag 29 mei 2007 16:40
> Aan: SEGERS Koen
> CC: general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
>
> can you describe the scenario in which you see data lost?
> does the "SDP: FIXME MID 11" message correlate with the data loss?
>
> On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote:
> > I just remembered that, with SDP, these values aren't related anymore.
> > SDP doesn't give this kind of information to the OS.
> >
> > Koen
> >
> > -----Oorspronkelijk bericht-----
> > Van: general-bounces at lists.openfabrics.org
> > [mailto:general-bounces at lists.openfabrics.org] Namens SEGERS Koen
> > Verzonden: dinsdag 29 mei 2007 14:29
> > Aan: amip at dev.mellanox.co.il
> > CC: general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> >
> > One of the machines has 2 dropped packets:
> >
> > gpfswhbe2n1:~ # ifconfig ib0
> > ib0 Link encap:UNSPEC HWaddr
> > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
> > inet addr:192.168.2.1 Bcast:192.168.4.255
> Mask:255.255.255.0
> > inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
> > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
> > RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
> > collisions:0 txqueuelen:128
> > RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb)
> >
> > Can this be related?
> >
> > Does anyone now how this is possible with sdp? I thought SDP was a RC.
> > I'm also curious how gpfs reacts to this. Do you know where I can find
> > the timestamp of these dropped packets?
> >
> > Koen
> >
> > -----Oorspronkelijk bericht-----
> > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il]
> > Verzonden: dinsdag 29 mei 2007 14:03
> > Aan: SEGERS Koen
> > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> >
> > if this is an actual resize request than there is no problem when it
> is
> > dropped.
> > since you are running rc1, no resize requests should be sent so this
> > means there is a problem since data could be dropped. do you notice
> lost
> > data?
> >
> > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible
> > that
> > > this message is added only a few days ago.
> > >
> > > How can you be so sure that this doesn't pose any problems?
> > >
> > > Koen
> > >
> > > -----Oorspronkelijk bericht-----
> > > Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il]
> > > Verzonden: dinsdag 29 mei 2007 13:35
> > > Aan: SEGERS Koen
> > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> > > general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > >
> > > this means you are getting a message your SDP does not recognize.
> > > message 11 is resize request which was added to sdp a few days ago.
> > > can it be that you are running 2 different versions of OFED?
> > > anywas, this doesn't pose any problem so you can ignore it.
> > >
> > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > > > Hi,
> > > >
> > > > Saturday we did a different stresstest.
> > > > This is what we see in the /var/log/messages:
> > > >
> > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> > > >
> > > > There were errors from that time on. Can someone explain me what
> > this
> > > > message does?
> > > >
> > > > Koen
> > > >
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> > > > Verzonden: woensdag 23 mei 2007 17:41
> > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > > general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > Try 20 seconds, I'm curious if if you are barely crossing the
> > > 10-second
> > > > threshold.
> > > >
> > > > Scott
> > > >
> > > > > -----Original Message-----
> > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > > > Cc: Clive Hall (clivhall);
> > > > > general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > >
> > > > > What value would you recommend then?
> > > > >
> > > > > Koen
> > > > >
> > > > > -----Oorspronkelijk bericht-----
> > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> > > > > Verzonden: woensdag 23 mei 2007 17:38
> > > > > Aan: SEGERS Koen; Hal Rosenstock
> > > > > CC: Clive Hall (clivhall);
> general-bounces at lists.openfabrics.org;
> > > > > general at lists.openfabrics.org
> > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > >
> > > > > The boot time of the host doesn't matter for this timeout.
> While
> > > the
> > > > > host is booting, the IB link is down anyway.
> > > > >
> > > > > Scott
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > > > Cc: Clive Hall (clivhall);
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > >
> > > > > > After a whole day of stresstesting with the MAD renicing
> > > > > turned on, we
> > > > > > got the error once. So I think I should raise the timeout on
> > > > > > the switch
> > > > > > also.
> > > > > >
> > > > > > It takes about 2 minutes to boot the system. Do you agree
> > > > > > that this is a
> > > > > > good value for the timeout?
> > > > > >
> > > > > > Scott,
> > > > > > Can you explain me the problem of the memlock?
> > > > > >
> > > > > > I saw that the SLES10 bug is only an issue in MVAPICH.
> > > > > Since we didn't
> > > > > > install this, the bug is not related to us. This is
> > > > > correct, isn't it?
> > > > > >
> > > > > > Greetz
> > > > > >
> > > > > > Koen
> > > > > >
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Hal Rosenstock [mailto:halr at voltaire.com]
> > > > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > > > general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > >
> > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen)
> wrote:
> > > > > > > No C code changes, just a few config file changes
> > > > > (RENICE_IB_MAD=yes
> > > > > > in
> > > > > > > openib.conf,
> > > > > >
> > > > > > Does the host really not respond to MAD requests for over 10
> > > > > > seconds in
> > > > > > some cases ?
> > > > > >
> > > > > > -- Hal
> > > > > >
> > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > > > SLES10 for bug 267, etc.).
> > > > > > >
> > > > > > > Scott Weitzenkamp
> > > > > > > SQA and Release Manager
> > > > > > > Server Virtualization Business Unit
> > > > > > > Cisco Systems
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > > > > > general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > >
> > > > > > > > This far, all tests seem to work.
> > > > > > > >
> > > > > > > > Thanks for the help!
> > > > > > > >
> > > > > > > > Scott,
> > > > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > > > >
> > > > > > > > Greetz
> > > > > > > >
> > > > > > > > Koen
> > > > > > > >
> > > > > > > > -----Oorspronkelijk bericht-----
> > > > > > > > Van: Scott Weitzenkamp (sweitzen)
> > [mailto:sweitzen at cisco.com]
> > > > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > > > > (clivhall)
> > > > > > > > CC: Shirley Ma; Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > >
> > > > > > > > It's not so much pinging every 10 seconds as expecting a
> > > > > > > > response within
> > > > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > > > >
> > > > > > > > You only need to do 1) or 2), not both. Cisco configures
> 1)
> >
> > > > > > > > in the OFED
> > > > > > > > binary RPMs we release at
> > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I
> > > > > > > > prefer to have
> > > > > > > > the host be more responsive.
> > > > > > > >
> > > > > > > >
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE]
> > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > > > > > > general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > > > >
> > > > > > > > > If I understand it wright, the switch is actually
> polling
> > > > > > > > > (=pinging) the
> > > > > > > > > interfaces every 10s. This means that when the interface
> > is
> > > > > > handling
> > > > > > > > > other traffic, the poll can fail and the port could be
> > > > > > > > > considered out of
> > > > > > > > > service. My question is then: "How can the timeout be
> > > reached
> > > > > > while
> > > > > > > > > packets are being sent/received to/from the interface?"
> > > > > > > > >
> > > > > > > > > Anyway, what timeout-value would you recommend for
> > > > > us? And why?
> > > > > > > > >
> > > > > > > > > To recapitulate: these are the actions I'll take
> tomorrow
> > > > > > > > > 1) change the MAD niceness of the servers
> > > > > > > > > 2) change the timeout on the switches
> > > > > > > > >
> > > > > > > > > Are these changes sufficient for the HCA's to keep
> > > > > > their ports in
> > > > > > > > > PORT_ACTIVE state?
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Koen
> > > > > > > > >
> > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp
> > > > > > > > (sweitzen) wrote:
> > > > > > > > > > Yes, you can tune it. Here's an example via the
> switch
> > > CLI:
> > > > > > > > > >
> > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix
> > > > > fe:80:00:00:00:00:00:00
> > > > > > > > > > node-timeout <value>
> > > > > > > > > >
> > > > > > > > > > The default is 10 seconds, it can be configured up to
> > > > > > > > 2000 seconds.
> > > > > > > > > > If a HCA is completely unresponsive for longer than
> the
> > > > > > > > node-timeout
> > > > > > > > > > value, then we consider that HCA out of service.
> > > > > > > > > >
> > > > > > > > > > Scott Weitzenkamp
> > > > > > > > > > SQA and Release Manager
> > > > > > > > > > Server Virtualization Business Unit
> > > > > > > > > > Cisco Systems
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > ______________________________________________________________
> > > > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com]
> > > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > > > > To: koen.segers at VRT.BE
> > > > > > > > > > Cc: Ami Perlmutter;
> > general at lists.openfabrics.org;
> > > > > > > > > > general-bounces at lists.openfabrics.org; Scott
> > > > > > Weitzenkamp
> > > > > > > > > > (sweitzen)
> > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses
> > > > > > IB-connection
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Koen,
> > > > > > > > > >
> > > > > > > > > > So it is most likely you hit the same bug as
> > > > > > 229 (Scott
> > > > > > > > > > pointed out earlier). The same workaround
> might
> > > > > > > > work for you
> > > > > > > > > > by renicing ib_mad as Scott suggested.
> > > > > > > > > >
> > > > > > > > > > I think this should be a SM query timeout
> > > > > > tunable value
> > > > > > in
> > > > > > > > > > Cisco SM. Am I right, Scott?
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Shirley Ma
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Inactive hide details for Koen Segers
> > > > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > > > > Segers <koen.segers at VRT.BE>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Koen Segers
> > > > > > > > > <koen.segers at VRT.BE>
> > > > > > > > > >
> > > > > > > > > > 05/22/07 11:14
> > AM
> > > > > > > > > > Please respond
> > to
> > > > > > > > > >
> > koen.segers at VRT.BE
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > To
> > > > > > > > > >
> > > > > > > > > > Shirley
> > > > > > > > > > Ma/Beaverton/IBM at IBMUS
> > > > > > > > > >
> > > > > > > > > > cc
> > > > > > > > > >
> > > > > > > > > > Ami Perlmutter
> > > > > > > > > > <amip at dev.mellanox.co.il>,
> > > > > > > > > general at lists.openfabrics.org,
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > > > > >
> > > > > > > > > > Subject
> > > > > > > > > >
> > > > > > > > > > RE:
> > > > > > > > > > [ofa-general]
> > > > > > > > > > GPFS node loses
> > > > > > > > > > IB-connection
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > It is the Cisco SM.
> > > > > > > > > >
> > > > > > > > > > SFS-7000P> show version
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > > System Version
> > > Information
> > > > > > > > > >
> > > > > > > > >
> > > ==============================================================
> > > > > > > > > ==================
> > > > > > > > > > system-version : SFS-7000P TopspinOS
>
> > > > > > > > 2.9.0 releng
> > > > > > > > > > #147
> > > > > > > > > > 10/25/2006 02:01:32
> > > > > > > > > > contact : tac at cisco.com
> > > > > > > > > > name : SFS-7000P
> > > > > > > > > > location : 170 West Tasman
> > Drive,
> > > > > > > > > San Jose, CA
> > > > > > > > > > 95134
> > > > > > > > > > up-time :
> 11(d):7(h):49(m):3(s)
> > > > > > > > > > last-change : none
> > > > > > > > > > last-config-save : none
> > > > > > > > > > action : none
> > > > > > > > > > result : none
> > > > > > > > > > oper-mode : normal
> > > > > > > > > >
> > > > > > > > > > There is also a command that gives the SM
> > version,
> > >
> > > > > > > > > but I can't
> > > > > > > > > > find it
> > > > > > > > > > right now.
> > > > > > > > > >
> > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
> > > wrote:
> > > > > > > > > > > Hello Koen,
> > > > > > > > > > >
> > > > > > > > > > > From the switch log, it looks a SM issue to
> > me.
> > > > > > > > > The node was
> > > > > > > > > > kicked
> > > > > > > > > > > out of the membership. Which SM you are
> > > > > > using in your
> > > > > > > > > > fabric?
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > > Shirley Ma
> > > > > > > > > > >
> > > > > > > > > > *** Disclaimer ***
> > > > > > > > > >
> > > > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > > >
> > > > > > > > > > nv van publiek recht
> > > > > > > > > > BTW BE 0244.142.664
> > > > > > > > > > RPR Brussel
> > > > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > *** Disclaimer ***
> > > > > > > > >
> > > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > > >
> > > > > > > > > nv van publiek recht
> > > > > > > > > BTW BE 0244.142.664
> > > > > > > > > RPR Brussel
> > > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > > >
> > > > > > > > >
> > > > > > > > *** Disclaimer ***
> > > > > > > >
> > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > >
> > > > > > > > nv van publiek recht
> > > > > > > > BTW BE 0244.142.664
> > > > > > > > RPR Brussel
> > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > >
> > > > > > > >
> > > > > > > _______________________________________________
> > > > > > > general mailing list
> > > > > > > general at lists.openfabrics.org
> > > > > > >
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > >
> > > > > > > To unsubscribe, please visit
> > > > > > http://openib.org/mailman/listinfo/openib-general
> > > > > >
> > > > > > *** Disclaimer ***
> > > > > >
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > >
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >
> > > > > >
> > > > > *** Disclaimer ***
> > > > >
> > > > > Vlaamse Radio- en Televisieomroep
> > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > >
> > > > > nv van publiek recht
> > > > > BTW BE 0244.142.664
> > > > > RPR Brussel
> > > > > http://www.vrt.be/disclaimer
> > > > >
> > > > >
> > > > *** Disclaimer ***
> > > >
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > >
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >
> > > >
> > > > _______________________________________________
> > > > general mailing list
> > > > general at lists.openfabrics.org
> > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > >
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > *** Disclaimer ***
> > >
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > >
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >
> > >
> >
> > *** Disclaimer ***
> >
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> >
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > *** Disclaimer ***
> >
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> >
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >
> >
>
> *** Disclaimer ***
>
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
>
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>
>
More information about the general
mailing list