[ofa-general] GPFS node loses IB-connection
SEGERS Koen
Koen.SEGERS at VRT.BE
Tue May 29 05:28:57 PDT 2007
One of the machines has 2 dropped packets:
gpfswhbe2n1:~ # ifconfig ib0
ib0 Link encap:UNSPEC HWaddr
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:192.168.2.1 Bcast:192.168.4.255 Mask:255.255.255.0
inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
RX packets:17311 errors:0 dropped:0 overruns:0 frame:0
TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0
collisions:0 txqueuelen:128
RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb)
Can this be related?
Does anyone now how this is possible with sdp? I thought SDP was a RC.
I'm also curious how gpfs reacts to this. Do you know where I can find
the timestamp of these dropped packets?
Koen
-----Oorspronkelijk bericht-----
Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il]
Verzonden: dinsdag 29 mei 2007 14:03
Aan: SEGERS Koen
CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
if this is an actual resize request than there is no problem when it is
dropped.
since you are running rc1, no resize requests should be sent so this
means there is a problem since data could be dropped. do you notice lost
data?
On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote:
> We are running ofed-1.2.RC1 on all machines. Hence it is impossible
that
> this message is added only a few days ago.
>
> How can you be so sure that this doesn't pose any problems?
>
> Koen
>
> -----Oorspronkelijk bericht-----
> Van: Ami Perlmutter [mailto:amip at dev.mellanox.co.il]
> Verzonden: dinsdag 29 mei 2007 13:35
> Aan: SEGERS Koen
> CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock;
> general-bounces at lists.openfabrics.org; general at lists.openfabrics.org
> Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
>
> this means you are getting a message your SDP does not recognize.
> message 11 is resize request which was added to sdp a few days ago.
> can it be that you are running 2 different versions of OFED?
> anywas, this doesn't pose any problem so you can ignore it.
>
> On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote:
> > Hi,
> >
> > Saturday we did a different stresstest.
> > This is what we see in the /var/log/messages:
> >
> > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11
> >
> > There were errors from that time on. Can someone explain me what
this
> > message does?
> >
> > Koen
> >
> > -----Oorspronkelijk bericht-----
> > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> > Verzonden: woensdag 23 mei 2007 17:41
> > Aan: SEGERS Koen; Hal Rosenstock
> > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > general at lists.openfabrics.org
> > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> >
> > Try 20 seconds, I'm curious if if you are barely crossing the
> 10-second
> > threshold.
> >
> > Scott
> >
> > > -----Original Message-----
> > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > Sent: Wednesday, May 23, 2007 8:39 AM
> > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock
> > > Cc: Clive Hall (clivhall);
> > > general-bounces at lists.openfabrics.org;
general at lists.openfabrics.org
> > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > >
> > > What value would you recommend then?
> > >
> > > Koen
> > >
> > > -----Oorspronkelijk bericht-----
> > > Van: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com]
> > > Verzonden: woensdag 23 mei 2007 17:38
> > > Aan: SEGERS Koen; Hal Rosenstock
> > > CC: Clive Hall (clivhall); general-bounces at lists.openfabrics.org;
> > > general at lists.openfabrics.org
> > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > >
> > > The boot time of the host doesn't matter for this timeout. While
> the
> > > host is booting, the IB link is down anyway.
> > >
> > > Scott
> > >
> > > > -----Original Message-----
> > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > > Sent: Wednesday, May 23, 2007 8:20 AM
> > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen)
> > > > Cc: Clive Hall (clivhall);
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > After a whole day of stresstesting with the MAD renicing
> > > turned on, we
> > > > got the error once. So I think I should raise the timeout on
> > > > the switch
> > > > also.
> > > >
> > > > It takes about 2 minutes to boot the system. Do you agree
> > > > that this is a
> > > > good value for the timeout?
> > > >
> > > > Scott,
> > > > Can you explain me the problem of the memlock?
> > > >
> > > > I saw that the SLES10 bug is only an issue in MVAPICH.
> > > Since we didn't
> > > > install this, the bug is not related to us. This is
> > > correct, isn't it?
> > > >
> > > > Greetz
> > > >
> > > > Koen
> > > >
> > > > -----Oorspronkelijk bericht-----
> > > > Van: Hal Rosenstock [mailto:halr at voltaire.com]
> > > > Verzonden: woensdag 23 mei 2007 16:12
> > > > Aan: Scott "Weitzenkamp (sweitzen)
> > > > CC: SEGERS Koen; Clive Hall (clivhall);
> > > > general-bounces at lists.openfabrics.org;
> general at lists.openfabrics.org
> > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > >
> > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) wrote:
> > > > > No C code changes, just a few config file changes
> > > (RENICE_IB_MAD=yes
> > > > in
> > > > > openib.conf,
> > > >
> > > > Does the host really not respond to MAD requests for over 10
> > > > seconds in
> > > > some cases ?
> > > >
> > > > -- Hal
> > > >
> > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on
> > > > > SLES10 for bug 267, etc.).
> > > > >
> > > > > Scott Weitzenkamp
> > > > > SQA and Release Manager
> > > > > Server Virtualization Business Unit
> > > > > Cisco Systems
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: SEGERS Koen [mailto:Koen.SEGERS at VRT.BE]
> > > > > > Sent: Wednesday, May 23, 2007 6:48 AM
> > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall)
> > > > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > > > general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > >
> > > > > > This far, all tests seem to work.
> > > > > >
> > > > > > Thanks for the help!
> > > > > >
> > > > > > Scott,
> > > > > > Are there more bugfixes that cisco does in its rpms?
> > > > > >
> > > > > > Greetz
> > > > > >
> > > > > > Koen
> > > > > >
> > > > > > -----Oorspronkelijk bericht-----
> > > > > > Van: Scott Weitzenkamp (sweitzen)
[mailto:sweitzen at cisco.com]
> > > > > > Verzonden: woensdag 23 mei 2007 0:39
> > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall
> > > > (clivhall)
> > > > > > CC: Shirley Ma; Ami Perlmutter;
general at lists.openfabrics.org;
> > > > > > general-bounces at lists.openfabrics.org
> > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection
> > > > > >
> > > > > > It's not so much pinging every 10 seconds as expecting a
> > > > > > response within
> > > > > > 10 seconds (Clive, correct me if I'm wrong).
> > > > > >
> > > > > > You only need to do 1) or 2), not both. Cisco configures 1)
> > > > > > in the OFED
> > > > > > binary RPMs we release at
> > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I
> > > > > > prefer to have
> > > > > > the host be more responsive.
> > > > > >
> > > > > >
> > > > > > Scott Weitzenkamp
> > > > > > SQA and Release Manager
> > > > > > Server Virtualization Business Unit
> > > > > > Cisco Systems
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Koen Segers [mailto:koen.segers at VRT.BE]
> > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM
> > > > > > > To: Scott Weitzenkamp (sweitzen)
> > > > > > > Cc: Shirley Ma; Ami Perlmutter;
> > > > > > > general at lists.openfabrics.org;
> > > > general-bounces at lists.openfabrics.org
> > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection
> > > > > > >
> > > > > > > If I understand it wright, the switch is actually polling
> > > > > > > (=pinging) the
> > > > > > > interfaces every 10s. This means that when the interface
is
> > > > handling
> > > > > > > other traffic, the poll can fail and the port could be
> > > > > > > considered out of
> > > > > > > service. My question is then: "How can the timeout be
> reached
> > > > while
> > > > > > > packets are being sent/received to/from the interface?"
> > > > > > >
> > > > > > > Anyway, what timeout-value would you recommend for
> > > us? And why?
> > > > > > >
> > > > > > > To recapitulate: these are the actions I'll take tomorrow
> > > > > > > 1) change the MAD niceness of the servers
> > > > > > > 2) change the timeout on the switches
> > > > > > >
> > > > > > > Are these changes sufficient for the HCA's to keep
> > > > their ports in
> > > > > > > PORT_ACTIVE state?
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Koen
> > > > > > >
> > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp
> > > > > > (sweitzen) wrote:
> > > > > > > > Yes, you can tune it. Here's an example via the switch
> CLI:
> > > > > > > >
> > > > > > > > SFS-7000D(config)# ib sm subnet-prefix
> > > fe:80:00:00:00:00:00:00
> > > > > > > > node-timeout <value>
> > > > > > > >
> > > > > > > > The default is 10 seconds, it can be configured up to
> > > > > > 2000 seconds.
> > > > > > > > If a HCA is completely unresponsive for longer than the
> > > > > > node-timeout
> > > > > > > > value, then we consider that HCA out of service.
> > > > > > > >
> > > > > > > > Scott Weitzenkamp
> > > > > > > > SQA and Release Manager
> > > > > > > > Server Virtualization Business Unit
> > > > > > > > Cisco Systems
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> ______________________________________________________________
> > > > > > > > From: Shirley Ma [mailto:xma at us.ibm.com]
> > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM
> > > > > > > > To: koen.segers at VRT.BE
> > > > > > > > Cc: Ami Perlmutter;
general at lists.openfabrics.org;
> > > > > > > > general-bounces at lists.openfabrics.org; Scott
> > > > Weitzenkamp
> > > > > > > > (sweitzen)
> > > > > > > > Subject: RE: [ofa-general] GPFS node loses
> > > > IB-connection
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Koen,
> > > > > > > >
> > > > > > > > So it is most likely you hit the same bug as
> > > > 229 (Scott
> > > > > > > > pointed out earlier). The same workaround might
> > > > > > work for you
> > > > > > > > by renicing ib_mad as Scott suggested.
> > > > > > > >
> > > > > > > > I think this should be a SM query timeout
> > > > tunable value
> > > > in
> > > > > > > > Cisco SM. Am I right, Scott?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Shirley Ma
> > > > > > > >
> > > > > > > >
> > > > > > > > Inactive hide details for Koen Segers
> > > > > > > <koen.segers at VRT.BE>Koen
> > > > > > > > Segers <koen.segers at VRT.BE>
> > > > > > > >
> > > > > > > >
> > > > > > > > Koen Segers
> > > > > > > <koen.segers at VRT.BE>
> > > > > > > >
> > > > > > > > 05/22/07 11:14
AM
> > > > > > > > Please respond
to
> > > > > > > >
koen.segers at VRT.BE
> > > > > > > >
> > > > > > > >
> > > > > > > > To
> > > > > > > >
> > > > > > > > Shirley
> > > > > > > > Ma/Beaverton/IBM at IBMUS
> > > > > > > >
> > > > > > > > cc
> > > > > > > >
> > > > > > > > Ami Perlmutter
> > > > > > > > <amip at dev.mellanox.co.il>,
> > > > > > > general at lists.openfabrics.org,
> > > > general-bounces at lists.openfabrics.org
> > > > > > > >
> > > > > > > > Subject
> > > > > > > >
> > > > > > > > RE:
> > > > > > > > [ofa-general]
> > > > > > > > GPFS node loses
> > > > > > > > IB-connection
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > It is the Cisco SM.
> > > > > > > >
> > > > > > > > SFS-7000P> show version
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> ==============================================================
> > > > > > > ==================
> > > > > > > > System Version
> Information
> > > > > > > >
> > > > > > >
> ==============================================================
> > > > > > > ==================
> > > > > > > > system-version : SFS-7000P TopspinOS
> > > > > > 2.9.0 releng
> > > > > > > > #147
> > > > > > > > 10/25/2006 02:01:32
> > > > > > > > contact : tac at cisco.com
> > > > > > > > name : SFS-7000P
> > > > > > > > location : 170 West Tasman
Drive,
> > > > > > > San Jose, CA
> > > > > > > > 95134
> > > > > > > > up-time : 11(d):7(h):49(m):3(s)
> > > > > > > > last-change : none
> > > > > > > > last-config-save : none
> > > > > > > > action : none
> > > > > > > > result : none
> > > > > > > > oper-mode : normal
> > > > > > > >
> > > > > > > > There is also a command that gives the SM
version,
>
> > > > > > > but I can't
> > > > > > > > find it
> > > > > > > > right now.
> > > > > > > >
> > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma
> wrote:
> > > > > > > > > Hello Koen,
> > > > > > > > >
> > > > > > > > > From the switch log, it looks a SM issue to
me.
> > > > > > > The node was
> > > > > > > > kicked
> > > > > > > > > out of the membership. Which SM you are
> > > > using in your
> > > > > > > > fabric?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Shirley Ma
> > > > > > > > >
> > > > > > > > *** Disclaimer ***
> > > > > > > >
> > > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > > >
> > > > > > > > nv van publiek recht
> > > > > > > > BTW BE 0244.142.664
> > > > > > > > RPR Brussel
> > > > > > > > http://www.vrt.be/disclaimer
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > *** Disclaimer ***
> > > > > > >
> > > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > > >
> > > > > > > nv van publiek recht
> > > > > > > BTW BE 0244.142.664
> > > > > > > RPR Brussel
> > > > > > > http://www.vrt.be/disclaimer
> > > > > > >
> > > > > > >
> > > > > > *** Disclaimer ***
> > > > > >
> > > > > > Vlaamse Radio- en Televisieomroep
> > > > > > Auguste Reyerslaan 52, 1043 Brussel
> > > > > >
> > > > > > nv van publiek recht
> > > > > > BTW BE 0244.142.664
> > > > > > RPR Brussel
> > > > > > http://www.vrt.be/disclaimer
> > > > > >
> > > > > >
> > > > > _______________________________________________
> > > > > general mailing list
> > > > > general at lists.openfabrics.org
> > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > >
> > > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > >
> > > > *** Disclaimer ***
> > > >
> > > > Vlaamse Radio- en Televisieomroep
> > > > Auguste Reyerslaan 52, 1043 Brussel
> > > >
> > > > nv van publiek recht
> > > > BTW BE 0244.142.664
> > > > RPR Brussel
> > > > http://www.vrt.be/disclaimer
> > > >
> > > >
> > > *** Disclaimer ***
> > >
> > > Vlaamse Radio- en Televisieomroep
> > > Auguste Reyerslaan 52, 1043 Brussel
> > >
> > > nv van publiek recht
> > > BTW BE 0244.142.664
> > > RPR Brussel
> > > http://www.vrt.be/disclaimer
> > >
> > >
> > *** Disclaimer ***
> >
> > Vlaamse Radio- en Televisieomroep
> > Auguste Reyerslaan 52, 1043 Brussel
> >
> > nv van publiek recht
> > BTW BE 0244.142.664
> > RPR Brussel
> > http://www.vrt.be/disclaimer
> >
> >
> > _______________________________________________
> > general mailing list
> > general at lists.openfabrics.org
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
> *** Disclaimer ***
>
> Vlaamse Radio- en Televisieomroep
> Auguste Reyerslaan 52, 1043 Brussel
>
> nv van publiek recht
> BTW BE 0244.142.664
> RPR Brussel
> http://www.vrt.be/disclaimer
>
>
*** Disclaimer ***
Vlaamse Radio- en Televisieomroep
Auguste Reyerslaan 52, 1043 Brussel
nv van publiek recht
BTW BE 0244.142.664
RPR Brussel
http://www.vrt.be/disclaimer
More information about the general
mailing list