[ofa-general] RE: OpenSM detection of duplicated GUIDs on loopback

Eitan Zahavi eitan at mellanox.co.il
Tue Jul 24 13:25:32 PDT 2007


 

	On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il> wrote: 

		Maybe  avoid the log if -y is provided?

	 
	That avoids the spew but the duplicated GUID is important to
know so IMO something in the "middle" is needed where duplicated GUIDs
are logged but not continually the same ones.
	[EZ]  
	OK so in -y mode only we track which ones were reported and do
not repeat the log?
	 


		Eitan Zahavi 
		Senior Engineering Director, Software Architect 
		Mellanox Technologies LTD 
		Tel:+972-4-9097208
		Fax:+972-4-9593245 
		P.O. Box 586 Yokneam 20692 ISRAEL 

		 
		
		

________________________________

			From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
			Sent: Tuesday, July 24, 2007 9:56 PM 
			
			To: Eitan Zahavi
			Cc: OpenFabrics General; Sasha Khapyorsky;
Yevgeny Kliteynik
			Subject: Re: OpenSM detection of duplicated
GUIDs on loopback
			

			 
			


			On 7/24/07, Eitan Zahavi <eitan at mellanox.co.il >
wrote:

				Hi Hal,
				 
				For many users such a critical failure
(one the SM can not really do anything with) is better aborted then
forgotten in some log file. 
				Anyway's the -y flag lets you ignore it
if you like.

			 
			So everything else continues to work fine with
-y ? In which case, I'm not sure which is the better default.
			 
			Users certainly won't like their logs filling up
with continuous duplicated GUID messages. The log spew should be cleaned
up IMO.
			 
			-- Hal

			 


			 

				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 


________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 9:38 PM 
				
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback
				

				 
				


				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				Hi Hal,
				 
				The code to find "duplicated" GUIDs stem
from real user cases where flawed 
				burning procedure caused actual GUID
duplications. There is nothing "impossible". 

				 
				No one said impossible; just a violation
of what globally unique (GU from GUID) really means. It's largely
because vendors allowed users to program non volatile RAM for GUIDs
rather than a real manufacturing process for this which guarantees
uniqueness that we are even discussing this aspect of it. 


				So it is really critical the the SM will
be able to recognize this case and abort.

				 
				I agree with the detect part but not the
abort part. Why can't it report these errors and continue on ? That
seems better to me than aborting.
				 
				-- Hal


				 
				It might be that for testing someone
wants to use a loopback plug that cause the same 
				port GUID appear on both sides of link -
but it is better to require the user doing the test 
				to set some flag than to miss such a
situation in real life cluster.
				 
				This requirement was written after many
people wasted many hours trying to figure out what was going on.
				PLEASE DO NOT TAKE IT AWAY
				
				 

				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 


________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 6:04 PM 
				
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback
				

				 
				


				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com ] 
				Sent: Tuesday, July 24, 2007 5:53 PM
				To: Eitan Zahavi
				Cc: OpenFabrics General; Sasha
Khapyorsky; Yevgeny Kliteynik
				Subject: Re: OpenSM detection of
duplicated GUIDs on loopback 
				
				 

				Hi Eitan,
				
				
				On 7/24/07, Eitan Zahavi
<eitan at mellanox.co.il > wrote: 

				Hi Hal,
				 
				What is this "loopback" connector used
for?
				Does not seem to me like a very useful
thing to do.

				 
				Perhaps not but no reason OpenSM can't
handle this more gracefully.


				Anyway, if it is not a production
environment we could add a "debug mode" (-d flag option) to ignore this
check.

				 
				Why would a separate flag be needed ?
				[EZ] Since I do not see any other
solution for the SM  to know it is really a loop back plug rather then
two devices with same GUID connected back to back ... 

				 
				"Technically", this should only occur
when looped back and not two devices with same GUID as GUID == globally
unique and a duplication indicates a "manufacturing" issue.
				 
				Anyhow, can't these be treated the same
(and handled more gracefully) without an additional option/flag ?
				 
				-- Hal


				
				 
				-- Hal


				 

				Eitan Zahavi 
				Senior Engineering Director, Software
Architect 
				Mellanox Technologies LTD 
				Tel:+972-4-9097208
				Fax:+972-4-9593245 
				P.O. Box 586 Yokneam 20692 ISRAEL 

				 


________________________________

				From: Hal Rosenstock
[mailto:hal.rosenstock at gmail.com] 
				Sent: Tuesday, July 24, 2007 5:31 PM
				To: OpenFabrics General
				Cc: Sasha Khapyorsky; Eitan Zahavi;
Yevgeny Kliteynik
				Subject: OpenSM detection of duplicated
GUIDs on loopback
				
				 
				
				Hi,
				 
				This is what starts off as a "minor"
issue and I know it has been discussed it somewhat in the past: 
				 
				Putting a loopback connector on a
(switch) link causes OpenSM to indicate duplicated GUID error 0D18 as
follows:
				
				__osm_ni_rcv_set_links
				{
				...
				          /*
				             When there are only two
nodes with exact same guids (connected back 
				             to back) - the previous
check for duplicated guid will not catch
				             them. But the link will be
from the port to itself...
				             Enhanced Port 0 is an
exception to this
				          */ 
				          if ((osm_node_get_node_guid(
p_node ) == p_ni_context->node_guid) &&
				              (port_num ==
p_ni_context->port_num) &&
				              (port_num != 0))
				          {
				            osm_log( p_rcv->p_log,
OSM_LOG_ERROR, 
	
"__osm_ni_rcv_set_links: ERR 0D18: "
				                     "Duplicate GUID
found by link from a port to itself:"
				                     "node 0x%" PRIx64
", port number 0x%X\n", 
				                     cl_ntoh64(
osm_node_get_node_guid( p_node ) ),
				                     port_num );
				...
				
				So this occurs over and over and over
and fills the log with the same spew. This should be improved IMO. 
				
				Is this really a fatal condition ?
Doesn't seem like it should be to me. 
				 
				Also, OpenSM can "ride" this out with -y
(stay on fatal) but is that safe for this condition ?
				 
				Seems like something like an extra
loopback bit should be added to some port structure which should cause
these links to be ignored. This bit would then be reset when the peer is
now longer itself. 
				
				Also, is there a relationship of this
with the 12x/duplicated GUID code ? 
				 
				Thanks.
				 
				-- Hal






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070724/39bdc7ce/attachment.html>


More information about the general mailing list