[openib-general] [PATCH] opensm: truncate log file when fs is overflowed

Doug Ledford dledford at redhat.com
Sun Aug 27 15:28:06 PDT 2006


On Sun, 2006-08-20 at 20:18 +0300, Sasha Khapyorsky wrote:
> On 13:01 Sun 20 Aug     , Hal Rosenstock wrote:
> > Hi Sasha,
> > 
> > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote:
> > > In case when OpenSM log file overflows filesystem and write() fails with
> > > 'No space left on device' try to truncate the log file and wrap-around
> > > logging.
> > 
> > Should it be an (admin) option as to whether to truncate the file or not
> > or is there no way to continue without logging (other than this) once
> > the log file fills the disk ?
> 
> In theory OpenSM may continue, but don't think it is good idea to leave
> overflowed disk on the SM machine (by default it is '/var/log'). For me
> truncating there looks as reasonable default behavior, don't think we
> need the option.

I would definitely put the option in, and in fact would default it to
*NOT* truncate.  If the disk is full, you have no idea why.  It *might*
be your logs, or it might be a mail bomb filling /var/spool/mail.  I'm
sure as an admin the last thing I would want is my apps deciding, based
upon incomplete information, that wiping out their log files is the
right thing to do.  To me that sounds more like an intruder covering his
tracks than a reasonable thing to do when confronted with ENOSPC.

Truncating logs is something best left up to the admin that's dealing
with the disk full problem in the first place.  After all, if it is
something like an errant app filling the mail spool, truncating the logs
just looses valuable logs while at the same time making room for the app
to keep on adding more to /var/spool/mail.  That's just wrong.  If you
run out of space, just quit logging things until the admin clears the
problem up.  If you put this code in, make the admin turn it on.  That
will keep opensm friendly to appliance like devices that are single task
subnet managers.  But I don't think having this patch always on makes
any sense on a multi task server.

> > 
> > See comment below as well.
> > 
> > -- Hal
> > 
> > > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > > ---
> > > 
> > >  osm/opensm/osm_log.c |   23 +++++++++++++++--------
> > >  1 files changed, 15 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c
> > > index 668e9a6..b4700c8 100644
> > > --- a/osm/opensm/osm_log.c
> > > +++ b/osm/opensm/osm_log.c
> > > @@ -58,6 +58,7 @@ #include <stdarg.h>
> > >  #include <fcntl.h>
> > >  #include <sys/types.h>
> > >  #include <sys/stat.h>
> > > +#include <errno.h>
> > >  
> > >  #ifndef WIN32
> > >  #include <sys/time.h>
> > > @@ -152,6 +153,7 @@ #endif    
> > >      cl_spinlock_acquire( &p_log->lock );
> > >  #ifdef WIN32
> > >      GetLocalTime(&st);
> > > + _retry:
> > >      ret = fprintf(   p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s",
> > >                       st.wHour, st.wMinute, st.wSecond, st.wMilliseconds,
> > >                       pid, buffer);
> > > @@ -159,6 +161,7 @@ #ifdef WIN32
> > >  #else
> > >      pid = pthread_self();
> > >      tim = time(NULL);
> > > + _retry:
> > >      ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s",
> > >                     ((result.tm_mon < 12) && (result.tm_mon >= 0) ? 
> > >                      month_str[result.tm_mon] : "???"),
> > > @@ -166,6 +169,18 @@ #else
> > >                     result.tm_min, result.tm_sec,
> > >                     usecs, pid, buffer);
> > >  #endif /*  WIN32 */
> > > +
> > > +    if (ret >= 0)
> > > +      log_exit_count = 0;
> > > +    else if (errno == ENOSPC && log_exit_count < 3) {
> > > +      int fd = fileno(p_log->out_port);
> > > +      fprintf(stderr, "log write failed: %s. Will truncate the log file.\n",
> > > +              strerror(errno));
> > > +      ftruncate(fd, 0);
> > 
> > Should return from ftruncate be checked here ?
> 
> May be checked, but I don't think that potential ftruncate() failure
> should change the flow - in case of failure we will try to continue
> with lseek() anyway (in order to wrap around the file at least).
> 
> Sasha
> 
> > 
> > > +      lseek(fd, 0, SEEK_SET);
> > > +      log_exit_count++;
> > > +      goto _retry;
> > > +    }
> > >      
> > >      /*
> > >        Flush log on errors too.
> > > @@ -174,14 +189,6 @@ #endif /*  WIN32 */
> > >        fflush( p_log->out_port );
> > >      
> > >      cl_spinlock_release( &p_log->lock );
> > > -    
> > > -    if (ret < 0)
> > > -    {
> > > -      if (log_exit_count++ < 10)
> > > -      {
> > > -        fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n");
> > > -      }
> > > -    }
> > >    }
> > >  }
> > >  
> > 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
-- 
Doug Ledford <dledford at redhat.com>
              GPG KeyID: CFBFF194
              http://people.redhat.com/dledford

Infiniband specific RPMs available at
              http://people.redhat.com/dledford/Infiniband
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060827/d8f50d63/attachment.sig>


More information about the general mailing list