[openib-general] [PATCH] opensm: truncate log file when fs is overflowed

Sasha Khapyorsky sashak at voltaire.com
Tue Aug 29 11:15:35 PDT 2006


On 18:28 Sun 27 Aug     , Doug Ledford wrote:
> On Sun, 2006-08-20 at 20:18 +0300, Sasha Khapyorsky wrote:
> > On 13:01 Sun 20 Aug     , Hal Rosenstock wrote:
> > > Hi Sasha,
> > > 
> > > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote:
> > > > In case when OpenSM log file overflows filesystem and write() fails with
> > > > 'No space left on device' try to truncate the log file and wrap-around
> > > > logging.
> > > 
> > > Should it be an (admin) option as to whether to truncate the file or not
> > > or is there no way to continue without logging (other than this) once
> > > the log file fills the disk ?
> > 
> > In theory OpenSM may continue, but don't think it is good idea to leave
> > overflowed disk on the SM machine (by default it is '/var/log'). For me
> > truncating there looks as reasonable default behavior, don't think we
> > need the option.
> 
> I would definitely put the option in, and in fact would default it to
> *NOT* truncate.  If the disk is full, you have no idea why.  It *might*
> be your logs, or it might be a mail bomb filling /var/spool/mail.  I'm
> sure as an admin the last thing I would want is my apps deciding, based
> upon incomplete information, that wiping out their log files is the
> right thing to do.  To me that sounds more like an intruder covering his
> tracks than a reasonable thing to do when confronted with ENOSPC.
> 
> Truncating logs is something best left up to the admin that's dealing
> with the disk full problem in the first place.  After all, if it is
> something like an errant app filling the mail spool, truncating the logs
> just looses valuable logs while at the same time making room for the app
> to keep on adding more to /var/spool/mail.  That's just wrong.  If you
> run out of space, just quit logging things until the admin clears the
> problem up.  If you put this code in, make the admin turn it on.  That
> will keep opensm friendly to appliance like devices that are single task
> subnet managers.  But I don't think having this patch always on makes
> any sense on a multi task server.

My expectation is that when OpenSM is running it will generate ENOSPC
more frequently than mail bombs, or other activities.

But I see your point - don't take this control from an admin... I will
do this ENOSPC handling optional - actually there is another patch was
submitted, there is the option which limits OpenSM log file size. Will
add ENOSPC processing under same option.

Hal, I will resend the patch soon.

Sasha

> 
> > > 
> > > See comment below as well.
> > > 
> > > -- Hal
> > > 
> > > > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > > > ---
> > > > 
> > > >  osm/opensm/osm_log.c |   23 +++++++++++++++--------
> > > >  1 files changed, 15 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c
> > > > index 668e9a6..b4700c8 100644
> > > > --- a/osm/opensm/osm_log.c
> > > > +++ b/osm/opensm/osm_log.c
> > > > @@ -58,6 +58,7 @@ #include <stdarg.h>
> > > >  #include <fcntl.h>
> > > >  #include <sys/types.h>
> > > >  #include <sys/stat.h>
> > > > +#include <errno.h>
> > > >  
> > > >  #ifndef WIN32
> > > >  #include <sys/time.h>
> > > > @@ -152,6 +153,7 @@ #endif    
> > > >      cl_spinlock_acquire( &p_log->lock );
> > > >  #ifdef WIN32
> > > >      GetLocalTime(&st);
> > > > + _retry:
> > > >      ret = fprintf(   p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s",
> > > >                       st.wHour, st.wMinute, st.wSecond, st.wMilliseconds,
> > > >                       pid, buffer);
> > > > @@ -159,6 +161,7 @@ #ifdef WIN32
> > > >  #else
> > > >      pid = pthread_self();
> > > >      tim = time(NULL);
> > > > + _retry:
> > > >      ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s",
> > > >                     ((result.tm_mon < 12) && (result.tm_mon >= 0) ? 
> > > >                      month_str[result.tm_mon] : "???"),
> > > > @@ -166,6 +169,18 @@ #else
> > > >                     result.tm_min, result.tm_sec,
> > > >                     usecs, pid, buffer);
> > > >  #endif /*  WIN32 */
> > > > +
> > > > +    if (ret >= 0)
> > > > +      log_exit_count = 0;
> > > > +    else if (errno == ENOSPC && log_exit_count < 3) {
> > > > +      int fd = fileno(p_log->out_port);
> > > > +      fprintf(stderr, "log write failed: %s. Will truncate the log file.\n",
> > > > +              strerror(errno));
> > > > +      ftruncate(fd, 0);
> > > 
> > > Should return from ftruncate be checked here ?
> > 
> > May be checked, but I don't think that potential ftruncate() failure
> > should change the flow - in case of failure we will try to continue
> > with lseek() anyway (in order to wrap around the file at least).
> > 
> > Sasha
> > 
> > > 
> > > > +      lseek(fd, 0, SEEK_SET);
> > > > +      log_exit_count++;
> > > > +      goto _retry;
> > > > +    }
> > > >      
> > > >      /*
> > > >        Flush log on errors too.
> > > > @@ -174,14 +189,6 @@ #endif /*  WIN32 */
> > > >        fflush( p_log->out_port );
> > > >      
> > > >      cl_spinlock_release( &p_log->lock );
> > > > -    
> > > > -    if (ret < 0)
> > > > -    {
> > > > -      if (log_exit_count++ < 10)
> > > > -      {
> > > > -        fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n");
> > > > -      }
> > > > -    }
> > > >    }
> > > >  }
> > > >  
> > > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> -- 
> Doug Ledford <dledford at redhat.com>
>               GPG KeyID: CFBFF194
>               http://people.redhat.com/dledford
> 
> Infiniband specific RPMs available at
>               http://people.redhat.com/dledford/Infiniband






More information about the general mailing list