[ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback
Eitan Zahavi
eitan at mellanox.co.il
Tue Jul 24 11:12:10 PDT 2007
Hi Hal,
The code to find "duplicated" GUIDs stem from real user cases where
flawed
burning procedure caused actual GUID duplications. There is nothing
"impossible".
So it is really critical the the SM will be able to recognize this case
and abort.
It might be that for testing someone wants to use a loopback plug that
cause the same
port GUID appear on both sides of link - but it is better to require the
user doing the test
to set some flag than to miss such a situation in real life cluster.
This requirement was written after many people wasted many hours trying
to figure out what was going on.
PLEASE DO NOT TAKE IT AWAY
Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL
________________________________
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com]
Sent: Tuesday, July 24, 2007 6:04 PM
To: Eitan Zahavi
Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny Kliteynik
Subject: Re: OpenSM detection of duplicated GUIDs on loopback
On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote:
From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com ]
Sent: Tuesday, July 24, 2007 5:53 PM
To: Eitan Zahavi
Cc: OpenFabrics General; Sasha Khapyorsky; Yevgeny
Kliteynik
Subject: Re: OpenSM detection of duplicated GUIDs on
loopback
Hi Eitan,
On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il >
wrote:
Hi Hal,
What is this "loopback" connector used
for?
Does not seem to me like a very useful
thing to do.
Perhaps not but no reason OpenSM can't handle
this more gracefully.
Anyway, if it is not a production
environment we could add a "debug mode" (-d flag option) to ignore this
check.
Why would a separate flag be needed ?
[EZ] Since I do not see any other solution for
the SM to know it is really a loop back plug rather then two devices
with same GUID connected back to back ...
"Technically", this should only occur when looped back and not
two devices with same GUID as GUID == globally unique and a duplication
indicates a "manufacturing" issue.
Anyhow, can't these be treated the same (and handled more
gracefully) without an additional option/flag ?
-- Hal
-- Hal
Eitan Zahavi
Senior Engineering Director, Software
Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL
________________________________
From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com]
Sent: Tuesday, July 24, 2007 5:31 PM
To: OpenFabrics General
Cc: Sasha Khapyorsky; Eitan Zahavi;
Yevgeny Kliteynik
Subject: OpenSM detection of duplicated
GUIDs on loopback
Hi,
This is what starts off as a "minor"
issue and I know it has been discussed it somewhat in the past:
Putting a loopback connector on a
(switch) link causes OpenSM to indicate duplicated GUID error 0D18 as
follows:
__osm_ni_rcv_set_links
{
...
/*
When there are only two
nodes with exact same guids (connected back
to back) - the previous
check for duplicated guid will not catch
them. But the link will be
from the port to itself...
Enhanced Port 0 is an
exception to this
*/
if ((osm_node_get_node_guid(
p_node ) == p_ni_context->node_guid) &&
(port_num ==
p_ni_context->port_num) &&
(port_num != 0))
{
osm_log( p_rcv->p_log,
OSM_LOG_ERROR,
"__osm_ni_rcv_set_links: ERR 0D18: "
"Duplicate GUID
found by link from a port to itself:"
"node 0x%" PRIx64
", port number 0x%X\n",
cl_ntoh64(
osm_node_get_node_guid( p_node ) ),
port_num );
...
So this occurs over and over and over
and fills the log with the same spew. This should be improved IMO.
Is this really a fatal condition ?
Doesn't seem like it should be to me.
Also, OpenSM can "ride" this out with -y
(stay on fatal) but is that safe for this condition ?
Seems like something like an extra
loopback bit should be added to some port structure which should cause
these links to be ignored. This bit would then be reset when the peer is
now longer itself.
Also, is there a relationship of this
with the 12x/duplicated GUID code ?
Thanks.
-- Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/ba0e1dc5/attachment.html>
More information about the general
mailing list