[ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback

Hal Rosenstock hal.rosenstock at gmail.com
Tue Jul 24 11:38:24 PDT 2007


On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
>  *Hi Hal,*
> **
> *The code to find "duplicated" GUIDs stem from real user cases where
> flawed *
> *burning procedure caused actual GUID duplications. There is nothing
> "impossible". *
>

No one said impossible; just a violation of what globally unique (GU from
GUID) really means. It's largely because vendors allowed users to program
non volatile RAM for GUIDs rather than a real manufacturing process for this
which guarantees uniqueness that we are even discussing this aspect of it.

 *So it is really critical the the SM will be able to recognize this case
> and abort.*
>

I agree with the detect part but not the abort part. Why can't it report
these errors and continue on ? That seems better to me than aborting.

-- Hal


> *It might be that for testing someone wants to use a loopback plug that
> cause the same *
> *port GUID appear on both sides of link - but it is better to require the
> user doing the test *
> *to set some flag than to miss such a situation in real life cluster.*
> **
> *This requirement was written after many people wasted many hours trying
> to figure out what was going on.*
> *PLEASE DO NOT TAKE IT AWAY*
> **
>
> *Eitan Zahavi***
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>  ------------------------------
> *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> *Sent:* Tuesday, July 24, 2007 6:04 PM
> *To:* Eitan Zahavi
> *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
>
>
>
>
> On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> >
> >  *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > *Sent:* Tuesday, July 24, 2007 5:53 PM
> > *To:* Eitan Zahavi
> > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >
> > Hi Eitan,
> >
> > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > >
> > >  *Hi Hal,*
> > > **
> > > *What is this "loopback" connector used for?*
> > > *Does not seem to me like a very useful thing to do.*
> > >
> > **
> > Perhaps not but no reason OpenSM can't handle this more gracefully.
> >
> >  *Anyway, if it is not a production environment we could add a "debug
> > > mode" (-d flag option) to ignore this check.*
> > >
> > **
> > Why would a separate flag be needed ?
> > *[EZ] Since I do not see any other solution for the SM  to know it is
> > really a loop back plug rather then two devices with same GUID connected
> > back to back ... *
> >
> >
> "Technically", this should only occur when looped back and not two devices
> with same GUID as GUID == globally unique and a duplication indicates a
> "manufacturing" issue.
>
> Anyhow, can't these be treated the same (and handled more gracefully)
> without an additional option/flag ?
>
> -- Hal
>
>
> > -- Hal
> >
> >  **
> > >
> > > *Eitan Zahavi***
> > > Senior Engineering Director, Software Architect
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > >  ------------------------------
> > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > > *Sent: *Tuesday, July 24, 2007 5:31 PM
> > > *To:* OpenFabrics General
> > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> > > *Subject:* OpenSM detection of duplicated GUIDs on loopback
> > >
> > >
> > >  Hi,
> > >
> > > This is what starts off as a "minor" issue and I know it has been
> > > discussed it somewhat in the past:
> > >
> > > Putting a loopback connector on a (switch) link causes OpenSM to
> > > indicate duplicated GUID error 0D18 as follows:
> > >
> > > __osm_ni_rcv_set_links
> > > {
> > > ...
> > >           /*
> > >              When there are only two nodes with exact same guids
> > > (connected back
> > >              to back) - the previous check for duplicated guid will
> > > not catch
> > >              them. But the link will be from the port to itself...
> > >              Enhanced Port 0 is an exception to this
> > >           */
> > >           if ((osm_node_get_node_guid( p_node ) ==
> > > p_ni_context->node_guid) &&
> > >               (port_num == p_ni_context->port_num) &&
> > >               (port_num != 0))
> > >           {
> > >             osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > >                      "__osm_ni_rcv_set_links: ERR 0D18: "
> > >                      "Duplicate GUID found by link from a port to
> > > itself:"
> > >                      "node 0x%" PRIx64 ", port number 0x%X\n",
> > >                      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> > >                      port_num );
> > > ...
> > >
> > > So this occurs over and over and over and fills the log with the same
> > > spew. This should be improved IMO.
> > >
> > > Is this really a fatal condition ? Doesn't seem like it should be to
> > > me.
> > >
> > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is that
> > > safe for this condition ?
> > >
> > > Seems like something like an extra loopback bit should be added to
> > > some port structure which should cause these links to be ignored. This bit
> > > would then be reset when the peer is now longer itself.
> > >
> > > Also, is there a relationship of this with the 12x/duplicated GUID
> > > code ?
> > >
> > > Thanks.
> > >
> > > -- Hal
> > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/b97e2a87/attachment.html>


More information about the general mailing list