[openib-general] RE: [PATCH] Opensm - duplicated guids issue

Hal Rosenstock halr at voltaire.com
Mon Dec 5 10:40:08 PST 2005


Hi Eitan,

On Mon, 2005-12-05 at 10:32, Eitan Zahavi wrote:
> Hi Hal,
>
> Please see my response below
> > > Currently if OpenSM discovers duplicated guids
> >
> > What is the cause of a duplicated GUID ? Is it a misconfiguration of
> > someone's firmware (rather than some error on the part of OpenSM) ? If
> > so, I'm not sure exiting SM is the best option. IMO the policy is to
> > decide which GUID to "honor" (either the original one or the new one).
> [EZ] There is no way to know which GUID to honor if this is the first
> sweep. More over the cause for duplicated GUID is from bad firmware
> burning.

IMO, leaving the configuration of a globally unique ID to firmware
configuration is a poor choice as it lends itself to being error prone.
It should be done at manufacturing time in something like an EEPROM. I
know this increases the cost, etc. but also reduces the chances of this
being an issue.

> Currently the last GUID found is honored but the fabric behind
> the first one is ignored.
> >
> > > or 12x link with lane reversal badly configured
> >
> > What does badly configured mean ? Does it mean the link does not come
> up
> > at all or just in some non desired mode ? How is "bad lane reversal"
> > reconfigured ?
> [EZ] Bad FW configuration. The details are provided in the IS3 PRM. But
> if one route the board and swizzle the lanes it has to enable automatic
> lane reversal detection in the INI file.
> >
> > Can't this also occur on a 4x link as well ?
> [EZ] No.
> >
> > >  it only issues an error to the log
> > > file. This issue, though, is much more problematic, since it will
> cause
> > > part of the subnet to be un-initialized.
> > > The following patch includes a fuller handling of the issue - first,
> > > issue an error message to the /var/log/messeges file as well.
> >
> > I am incorporating this part of the patch.
> >
> > > Second - add an option flag to the SM that will define wether or not
> > > to exit on such case.
> >
> > Also, there are other scenarios which mark the subnet initialization
> as
> > failed (but don't exit the SM). This seems inconsistent to me. These
> > cases also do not put errors out on syslog. Should they ?
> >
> > IMO, in general, exiting out of OpenSM should be avoided at all costs.
> > The admin can always cause this to occur if desired and operating part
> > of the subnet is better than none. Are these cases where the admin
> would
> > not want to run the SM until the issues were resolved ?
> [EZ] The case of "bad connectivity" is different then "initialization
> failure":
> "bad connectivity" is a static problem caused by bad firmware options
> used or even bad hardware. "initialization failure" can be caused by
> management packet dropping which may happen due to flaky links or even
> reasonable bit error rate.

I think there are other cases aside from the "bad connectivity" cases
you cite (as was seen at SC05).

> The proposal is to provide an option for the sake of exiting the SM on
> such "bad hardware/firmware" conditions. If one wants to keep going all
> he has to do is to set that option to 0.
>
> Needless to say we have proposed this "exit condition" based on our
> experience where such cases have happened and the log message ignored.
> Such that many man hours could have been saved if the SM would insist on
> not running under such conditions.

I think there is a chance that there will be support calls this way too
since the OpenSM won't come up at all in this case. We can always change
the default for this (for exiting on these errors) from TRUE to FALSE if
and when this becomes an issue... Anyone else have an opinion on this ?

-- Hal

> >
> > -- Hal
> >
> > > Thanks,
> > > Yael
> > >
> > > Signed-off-by:  Yael Kalka <yael at mellanox.co.il>
> > >
> > > Index: include/opensm/osm_subnet.h
> > > ===================================================================
> > > --- include/opensm/osm_subnet.h       (revision 4288)
> > > +++ include/opensm/osm_subnet.h       (working copy)
> > > @@ -235,6 +235,7 @@ typedef struct _osm_subn_opt
> > >    osm_testability_modes_t  testability_mode;
> > >    boolean_t                updn_activate;
> > >    char *                   updn_guid_file;
> > > +  boolean_t                exit_on_fatal;
> > >  } osm_subn_opt_t;
> > >  /*
> > >  * FIELDS
> > > @@ -372,6 +373,13 @@ typedef struct _osm_subn_opt
> > >  *  updn_guid_file
> > >  *     Pointer to name of the UPDN guid file given by User
> > >  *
> > > +*  exit_on_fatal
> > > +*     If TRUE (default) - SM will exit on fatal subnet
> initialization issues.
> > > +*     If FALSE - SM will not exit.
> > > +*     Fatal initialization issues:
> > > +*     a. SM recognizes 2 different nodes with the same guid, or 12x
> link with
> > > +*        lane reversal badly configured.
> > > +*
> > >  * SEE ALSO
> > >  *    Subnet object
> > >  *********/
> > > Index: opensm/osm_subnet.c
> > > ===================================================================
> > > --- opensm/osm_subnet.c       (revision 4288)
> > > +++ opensm/osm_subnet.c       (working copy)
> > > @@ -440,6 +440,7 @@ osm_subn_set_default_opt(
> > >    p_opt->testability_mode = OSM_TEST_MODE_NONE;
> > >    p_opt->updn_activate = FALSE;
> > >    p_opt->updn_guid_file = NULL;
> > > +  p_opt->exit_on_fatal = TRUE;
> > >  }
> > >
> > >
> /**********************************************************************
> > > @@ -765,6 +766,10 @@ osm_subn_parse_conf_file(
> > >        __osm_subn_opts_unpack_charp(
> > >          "updn_guid_file" ,
> > >          p_key, p_val, &p_opts->updn_guid_file);
> > > +
> > > +      __osm_subn_opts_unpack_boolean(
> > > +        "exit_on_fatal",
> > > +        p_key, p_val, &p_opts->exit_on_fatal);
> > >      }
> > >    }
> > >    fclose(opts_file);
> > > @@ -930,14 +935,17 @@ osm_subn_write_conf_file(
> > >      "# If TRUE if OpenSM should disable multicast support\n"
> > >      "no_multicast_option %s\n\n"
> > >      "# No multicast routing is performed if TRUE\n"
> > > -    "disable_multicast %s\n\n",
> > > +    "disable_multicast %s\n\n"
> > > +    "# If TRUE opensm will exit on fatal initialization issues\n"
> > > +    "exit_on_fatal %s\n\n",
> > >      p_opts->log_flags,
> > >      p_opts->force_log_flush ? "TRUE" : "FALSE",
> > >      p_opts->log_file,
> > >      p_opts->accum_log_file ? "TRUE" : "FALSE",
> > >      p_opts->dump_files_dir,
> > >      p_opts->no_multicast_option ? "TRUE" : "FALSE",
> > > -    p_opts->disable_multicast ? "TRUE" : "FALSE"
> > > +    p_opts->disable_multicast ? "TRUE" : "FALSE",
> > > +    p_opts->exit_on_fatal ? "TRUE" : "FALSE"
> > >      );
> > >
> > >    /* optional string attributes ... */
> > > Index: opensm/osm_node_info_rcv.c
> > > ===================================================================
> > > --- opensm/osm_node_info_rcv.c        (revision 4288)
> > > +++ opensm/osm_node_info_rcv.c        (working copy)
> > > @@ -198,6 +198,14 @@ __osm_ni_rcv_set_links(
> > >                       p_ni_context->port_num,
> > >                       dr_new_path
> > >                       );
> > > +
> > > +            osm_log( p_rcv->p_log, OSM_LOG_SYS,
> > > +                     "Errors on subnet. SM found duplicated guids
> or 12x "
> > > +                     "link with lane reversal badly configured. "
> > > +                     "Use osm log for more details.\n");
> > > +
> > > +            if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE )
> > > +              exit( 1 );
> > >            }
> > >
> > >            /*
> > > Index: opensm/main.c
> > > ===================================================================
> > > --- opensm/main.c     (revision 4288)
> > > +++ opensm/main.c     (working copy)
> > > @@ -178,6 +178,12 @@ show_usage(void)
> > >            "          This option will cause deletion of the log
> file\n"
> > >            "          (if it previously exists). By default, the log
> file\n"
> > >            "          is accumulative.\n\n");
> > > +  printf( "-y\n"
> > > +          "--stay_on_fatal\n"
> > > +          "          This option will cause SM not to exit on fatal
> initialization\n"
> > > +          "          issues: If SM discovers duplicated guids or
> 12x link with\n"
> > > +          "          lane reversal badly configured.\n"
> > > +          "          By default, the SM will exit.\n\n");
> > >    printf( "-v\n"
> > >            "--verbose\n"
> > >            "          This option increases the log verbosity
> level.\n"
> > > @@ -460,7 +466,7 @@ main(
> > >    boolean_t             cache_options = FALSE;
> > >    char                 *ignore_guids_file_name = NULL;
> > >    uint32_t              val;
> > > -  const char * const    short_option = "i:f:ed:g:l:s:t:a:uvVhorc";
> > > +  const char * const    short_option = "i:f:ed:g:l:s:t:a:uvVhorcy";
> > >
> > >    /*
> > >      In the array below, the 2nd parameter specified the number
> > > @@ -492,6 +498,7 @@ main(
> > >        {  "updn",          0, NULL, 'u'},
> > >        {  "add_guid_file", 1, NULL, 'a'},
> > >        {  "cache-options", 0, NULL, 'c'},
> > > +      {  "stay_on_fatal", 0, NULL, 'y'},
> > >        {  NULL,            0, NULL,  0 }  /* Required at the end of
> the array */
> > >      };
> > >
> > > @@ -665,6 +672,11 @@ main(
> > >        printf(" Creating new log file\n");
> > >        break;
> > >
> > > +    case 'y':
> > > +      opt.exit_on_fatal = FALSE;
> > > +      printf(" Staying on fatal initialization\n");
> > > +      break;
> > > +
> > >      case 'v':
> > >        log_flags = (log_flags <<1 )|1;
> > >        printf(" Verbose option -v (log flags = 0x%X)\n", log_flags
> );
> > >
>






More information about the general mailing list