[ofa-general] Re: OpenSM detection of duplicated GUIDs on loopback
Hal Rosenstock
hal.rosenstock at gmail.com
Wed Jul 25 10:46:31 PDT 2007
On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
>
> **
>
>
> On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
> >
> > *Maybe avoid the log if -y is provided?*
> >
> **
> That avoids the spew but the duplicated GUID is important to know so IMO
> something in the "middle" is needed where duplicated GUIDs are logged but
> not continually the same ones.
> *[EZ] OK so in -y mode only we track which ones were reported and do not
> repeat the log?
> *
>
>
Any good ideas on how to accomplish this ?
-- Hal
*Eitan Zahavi***
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> > **
> >
> > ------------------------------
> > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > *Sent:* Tuesday, July 24, 2007 9:56 PM
> > *To:* Eitan Zahavi
> > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> >
> >
> >
> >
> > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> >
> > > *Hi Hal,*
> > > **
> > > *For many users such a critical failure (one the SM can not really do
> > > anything with) is better aborted then forgotten in some log file.*
> > > *Anyway's the -y flag lets you ignore it if you like.*
> > >
> >
> > So everything else continues to work fine with -y ? In which case, I'm
> > not sure which is the better default.
> >
> > Users certainly won't like their logs filling up with continuous
> > duplicated GUID messages. The log spew should be cleaned up IMO.
> >
> > -- Hal
> >
> >
> >
> >
> >
> > > *Eitan Zahavi***
> > > Senior Engineering Director, Software Architect
> > > Mellanox Technologies LTD
> > > Tel:+972-4-9097208
> > > Fax:+972-4-9593245
> > > P.O. Box 586 Yokneam 20692 ISRAEL
> > >
> > >
> > > ------------------------------
> > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > *Sent:* Tuesday, July 24, 2007 9:38 PM
> > > *To:* Eitan Zahavi
> > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > >
> > >
> > >
> > >
> > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > >
> > > > *Hi Hal,*
> > > > **
> > > > *The code to find "duplicated" GUIDs stem from real user cases where
> > > > flawed *
> > > > *burning procedure caused actual GUID duplications. There is nothing
> > > > "impossible". *
> > > >
> > >
> > > No one said impossible; just a violation of what globally unique (GU
> > > from GUID) really means. It's largely because vendors allowed users to
> > > program non volatile RAM for GUIDs rather than a real manufacturing process
> > > for this which guarantees uniqueness that we are even discussing this aspect
> > > of it.
> > >
> > > *So it is really critical the the SM will be able to recognize this
> > > > case and abort.*
> > > >
> > >
> > > I agree with the detect part but not the abort part. Why can't it
> > > report these errors and continue on ? That seems better to me than aborting.
> > >
> > > -- Hal
> > >
> > >
> > > > *It might be that for testing someone wants to use a loopback plug
> > > > that cause the same *
> > > > *port GUID appear on both sides of link - but it is better to
> > > > require the user doing the test *
> > > > *to set some flag than to miss such a situation in real life
> > > > cluster.*
> > > > **
> > > > *This requirement was written after many people wasted many hours
> > > > trying to figure out what was going on.*
> > > > *PLEASE DO NOT TAKE IT AWAY*
> > > > **
> > > >
> > > > *Eitan Zahavi***
> > > > Senior Engineering Director, Software Architect
> > > > Mellanox Technologies LTD
> > > > Tel:+972-4-9097208
> > > > Fax:+972-4-9593245
> > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > >
> > > >
> > > > ------------------------------
> > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > > *Sent:* Tuesday, July 24, 2007 6:04 PM
> > > > *To:* Eitan Zahavi
> > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > > >
> > > >
> > > >
> > > >
> > > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > > >
> > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
> > > > > *Sent:* Tuesday, July 24, 2007 5:53 PM
> > > > > *To:* Eitan Zahavi
> > > > > *Cc:* OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
> > > > > *Subject:* Re: OpenSM detection of duplicated GUIDs on loopback
> > > > >
> > > > >
> > > > >
> > > > > Hi Eitan,
> > > > >
> > > > > On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il > wrote:
> > > > > >
> > > > > > *Hi Hal,*
> > > > > > **
> > > > > > *What is this "loopback" connector used for?*
> > > > > > *Does not seem to me like a very useful thing to do.*
> > > > > >
> > > > > **
> > > > > Perhaps not but no reason OpenSM can't handle this more
> > > > > gracefully.
> > > > >
> > > > > *Anyway, if it is not a production environment we could add a
> > > > > > "debug mode" (-d flag option) to ignore this check.*
> > > > > >
> > > > > **
> > > > > Why would a separate flag be needed ?
> > > > > *[EZ] Since I do not see any other solution for the SM to know it
> > > > > is really a loop back plug rather then two devices with same GUID connected
> > > > > back to back ... *
> > > > >
> > > > >
> > > > "Technically", this should only occur when looped back and not two
> > > > devices with same GUID as GUID == globally unique and a duplication
> > > > indicates a "manufacturing" issue.
> > > >
> > > > Anyhow, can't these be treated the same (and handled more
> > > > gracefully) without an additional option/flag ?
> > > >
> > > > -- Hal
> > > >
> > > >
> > > > > -- Hal
> > > > >
> > > > > **
> > > > > >
> > > > > > *Eitan Zahavi***
> > > > > > Senior Engineering Director, Software Architect
> > > > > > Mellanox Technologies LTD
> > > > > > Tel:+972-4-9097208
> > > > > > Fax:+972-4-9593245
> > > > > > P.O. Box 586 Yokneam 20692 ISRAEL
> > > > > >
> > > > > >
> > > > > > ------------------------------
> > > > > > *From:* Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
> > > > > > *Sent: *Tuesday, July 24, 2007 5:31 PM
> > > > > > *To:* OpenFabrics General
> > > > > > *Cc:* Sasha Khapyorsky; Eitan Zahavi; Yevgeny Kliteynik
> > > > > > *Subject:* OpenSM detection of duplicated GUIDs on loopback
> > > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This is what starts off as a "minor" issue and I know it has
> > > > > > been discussed it somewhat in the past:
> > > > > >
> > > > > > Putting a loopback connector on a (switch) link causes OpenSM to
> > > > > > indicate duplicated GUID error 0D18 as follows:
> > > > > >
> > > > > > __osm_ni_rcv_set_links
> > > > > > {
> > > > > > ...
> > > > > > /*
> > > > > > When there are only two nodes with exact same guids
> > > > > > (connected back
> > > > > > to back) - the previous check for duplicated guid
> > > > > > will not catch
> > > > > > them. But the link will be from the port to
> > > > > > itself...
> > > > > > Enhanced Port 0 is an exception to this
> > > > > > */
> > > > > > if ((osm_node_get_node_guid( p_node ) ==
> > > > > > p_ni_context->node_guid) &&
> > > > > > (port_num == p_ni_context->port_num) &&
> > > > > > (port_num != 0))
> > > > > > {
> > > > > > osm_log( p_rcv->p_log, OSM_LOG_ERROR,
> > > > > > "__osm_ni_rcv_set_links: ERR 0D18: "
> > > > > > "Duplicate GUID found by link from a port
> > > > > > to itself:"
> > > > > > "node 0x%" PRIx64 ", port number 0x%X\n",
> > > > > > cl_ntoh64( osm_node_get_node_guid( p_node )
> > > > > > ),
> > > > > > port_num );
> > > > > > ...
> > > > > >
> > > > > > So this occurs over and over and over and fills the log with the
> > > > > > same spew. This should be improved IMO.
> > > > > >
> > > > > > Is this really a fatal condition ? Doesn't seem like it should
> > > > > > be to me.
> > > > > >
> > > > > > Also, OpenSM can "ride" this out with -y (stay on fatal) but is
> > > > > > that safe for this condition ?
> > > > > >
> > > > > > Seems like something like an extra loopback bit should be added
> > > > > > to some port structure which should cause these links to be ignored. This
> > > > > > bit would then be reset when the peer is now longer itself.
> > > > > >
> > > > > > Also, is there a relationship of this with the 12x/duplicated
> > > > > > GUID code ?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > -- Hal
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070725/91ce64d6/attachment.html>
More information about the general
mailing list