[openib-general] [PATCH] opensm: truncate log file when fs is overflowed

Hal Rosenstock halr at voltaire.com
Tue Aug 29 12:55:01 PDT 2006


Hi Sasha,

On Tue, 2006-08-29 at 14:15, Sasha Khapyorsky wrote:
> On 18:28 Sun 27 Aug     , Doug Ledford wrote:
> > On Sun, 2006-08-20 at 20:18 +0300, Sasha Khapyorsky wrote:
> > > On 13:01 Sun 20 Aug     , Hal Rosenstock wrote:
> > > > Hi Sasha,
> > > > 
> > > > On Sun, 2006-08-20 at 12:05, Sasha Khapyorsky wrote:
> > > > > In case when OpenSM log file overflows filesystem and write() fails with
> > > > > 'No space left on device' try to truncate the log file and wrap-around
> > > > > logging.
> > > > 
> > > > Should it be an (admin) option as to whether to truncate the file or not
> > > > or is there no way to continue without logging (other than this) once
> > > > the log file fills the disk ?
> > > 
> > > In theory OpenSM may continue, but don't think it is good idea to leave
> > > overflowed disk on the SM machine (by default it is '/var/log'). For me
> > > truncating there looks as reasonable default behavior, don't think we
> > > need the option.
> > 
> > I would definitely put the option in, and in fact would default it to
> > *NOT* truncate.  If the disk is full, you have no idea why.  It *might*
> > be your logs, or it might be a mail bomb filling /var/spool/mail.  I'm
> > sure as an admin the last thing I would want is my apps deciding, based
> > upon incomplete information, that wiping out their log files is the
> > right thing to do.  To me that sounds more like an intruder covering his
> > tracks than a reasonable thing to do when confronted with ENOSPC.
> > 
> > Truncating logs is something best left up to the admin that's dealing
> > with the disk full problem in the first place.  After all, if it is
> > something like an errant app filling the mail spool, truncating the logs
> > just looses valuable logs while at the same time making room for the app
> > to keep on adding more to /var/spool/mail.  That's just wrong.  If you
> > run out of space, just quit logging things until the admin clears the
> > problem up.  If you put this code in, make the admin turn it on.  That
> > will keep opensm friendly to appliance like devices that are single task
> > subnet managers.  But I don't think having this patch always on makes
> > any sense on a multi task server.
> 
> My expectation is that when OpenSM is running it will generate ENOSPC
> more frequently than mail bombs, or other activities.
> 
> But I see your point - don't take this control from an admin... I will
> do this ENOSPC handling optional - actually there is another patch was
> submitted, there is the option which limits OpenSM log file size. Will
> add ENOSPC processing under same option.
> 
> Hal, I will resend the patch soon.

I'd prefer an incremental one off the last patch related to this if that
isn't too much work as I'm close to committing the previous one now (and
it'd be more work to start over on this).

-- Hal

> Sasha
> 
> > 
> > > > 
> > > > See comment below as well.
> > > > 
> > > > -- Hal
> > > > 
> > > > > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > > > > ---
> > > > > 
> > > > >  osm/opensm/osm_log.c |   23 +++++++++++++++--------
> > > > >  1 files changed, 15 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/osm/opensm/osm_log.c b/osm/opensm/osm_log.c
> > > > > index 668e9a6..b4700c8 100644
> > > > > --- a/osm/opensm/osm_log.c
> > > > > +++ b/osm/opensm/osm_log.c
> > > > > @@ -58,6 +58,7 @@ #include <stdarg.h>
> > > > >  #include <fcntl.h>
> > > > >  #include <sys/types.h>
> > > > >  #include <sys/stat.h>
> > > > > +#include <errno.h>
> > > > >  
> > > > >  #ifndef WIN32
> > > > >  #include <sys/time.h>
> > > > > @@ -152,6 +153,7 @@ #endif    
> > > > >      cl_spinlock_acquire( &p_log->lock );
> > > > >  #ifdef WIN32
> > > > >      GetLocalTime(&st);
> > > > > + _retry:
> > > > >      ret = fprintf(   p_log->out_port, "[%02d:%02d:%02d:%03d][%04X] -> %s",
> > > > >                       st.wHour, st.wMinute, st.wSecond, st.wMilliseconds,
> > > > >                       pid, buffer);
> > > > > @@ -159,6 +161,7 @@ #ifdef WIN32
> > > > >  #else
> > > > >      pid = pthread_self();
> > > > >      tim = time(NULL);
> > > > > + _retry:
> > > > >      ret = fprintf( p_log->out_port, "%s %02d %02d:%02d:%02d %06d [%04X] -> %s",
> > > > >                     ((result.tm_mon < 12) && (result.tm_mon >= 0) ? 
> > > > >                      month_str[result.tm_mon] : "???"),
> > > > > @@ -166,6 +169,18 @@ #else
> > > > >                     result.tm_min, result.tm_sec,
> > > > >                     usecs, pid, buffer);
> > > > >  #endif /*  WIN32 */
> > > > > +
> > > > > +    if (ret >= 0)
> > > > > +      log_exit_count = 0;
> > > > > +    else if (errno == ENOSPC && log_exit_count < 3) {
> > > > > +      int fd = fileno(p_log->out_port);
> > > > > +      fprintf(stderr, "log write failed: %s. Will truncate the log file.\n",
> > > > > +              strerror(errno));
> > > > > +      ftruncate(fd, 0);
> > > > 
> > > > Should return from ftruncate be checked here ?
> > > 
> > > May be checked, but I don't think that potential ftruncate() failure
> > > should change the flow - in case of failure we will try to continue
> > > with lseek() anyway (in order to wrap around the file at least).
> > > 
> > > Sasha
> > > 
> > > > 
> > > > > +      lseek(fd, 0, SEEK_SET);
> > > > > +      log_exit_count++;
> > > > > +      goto _retry;
> > > > > +    }
> > > > >      
> > > > >      /*
> > > > >        Flush log on errors too.
> > > > > @@ -174,14 +189,6 @@ #endif /*  WIN32 */
> > > > >        fflush( p_log->out_port );
> > > > >      
> > > > >      cl_spinlock_release( &p_log->lock );
> > > > > -    
> > > > > -    if (ret < 0)
> > > > > -    {
> > > > > -      if (log_exit_count++ < 10)
> > > > > -      {
> > > > > -        fprintf(stderr, "OSM LOG FAILURE! Quota probably exceeded\n");
> > > > > -      }
> > > > > -    }
> > > > >    }
> > > > >  }
> > > > >  
> > > > 
> > > 
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > -- 
> > Doug Ledford <dledford at redhat.com>
> >               GPG KeyID: CFBFF194
> >               http://people.redhat.com/dledford
> > 
> > Infiniband specific RPMs available at
> >               http://people.redhat.com/dledford/Infiniband
> 
> 





More information about the general mailing list