[openib-general] RE: [PATCH] Opensm - duplicated guids handling

Yael Kalka yael at mellanox.co.il
Sun Jan 22 00:29:51 PST 2006


Hi Hal,
The configuration are 2 HCAs with duplicated GUIDs on 2 different machines,
which are connected back-2-back.

Regarding the duplicated guids issue itself - as you said, it is a fundamental
violation. The problem is that we've had cases where there was such a violation,
and since OpenSM didn't give a clear enough error message - there was a waste 
of time in trying to debug why OpenSM doesn't configure the subnet correctly.
I have done testing to make sure that the different cases of duplication of guids
are handled, both on subnets with switches, and on back-2-back machines.
This was the problem left.

Using O_NONBLOCK works fine for me. I will send a patch seperately with this
fix instead of the original one.

Thanks,
Yael

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com]
Sent: Thursday, January 19, 2006 4:51 PM
To: Yael Kalka
Cc: openib-general at openib.org; Eitan Zahavi
Subject: Re: [PATCH] Opensm - duplicated guids handling


Hi Yael,

On Thu, 2006-01-19 at 07:08, Yael Kalka wrote:
> Hi Hal,
> 
> We've noticed that currently if we have 2 hcas with duplicated guids

I renew my comment about duplicated GUIDs. This is a pretty fundamental
thing that MUST not be violated per the IBA spec. I understand there are
processes in place that make the duplication more error prone than it
should be.

If we go down this path, there are other things that fall into this
category and I believe this to be a slippery slope.

I am still willing to go ahead with this patch or some variant of it.
Some questions embedded in the patch.

> connected back-2-back, opensm gets stuck.

Not sure I quite understand the configuration. Are the two HCAs with the
duplicated guids in the same machine and connected back to back ? Is
that the case you are referring to ?

>  The reason for that is that
> in osm_vendor_set_sm() function - the second call trying to open the
> /dev/infiniband/issm%id is stuck, since this file is already open.
> The following patch fixes 2 things -
> 1. In osm_node_info_rcv.c - we've added a case that on cases of
> duplicated guids - exit (unless a flag is set otherwise). Add this
> exiting code also to the case where the nodes are connected back-2-back.
> 2. In osm_vendor_ibumad.c - add a static variable to avoid trying to
> open /dev/inifiniband/issm%d file twice during the run of opensm.

The problem is that the second open hangs, right ? So rather than the
changes to osm_vendor_ibumad.c below change the flags on the open from 0
to O_NONBLOCK ? Does that work for you ?

If so, I will commit that approach with the change below to
osm_node_info_rcv.c. Please let me know. Thanks.

-- Hal

> Signed-off-by:  Yael Kalka <yael at mellanox.co.il>
> 
> Index: libvendor/osm_vendor_ibumad.c
> ===================================================================
> --- libvendor/osm_vendor_ibumad.c	(revision 4951)
> +++ libvendor/osm_vendor_ibumad.c	(working copy)
> @@ -1142,8 +1142,11 @@ osm_vendor_set_sm(
>  	osm_umad_bind_info_t *p_bind = (osm_umad_bind_info_t *)h_bind;
>  	osm_vendor_t *p_vend = p_bind->p_vend;
>  	char issmstring[24];
> +   static boolean_t osm_vendor_set_sm_indicator = FALSE;
>  
>  	OSM_LOG_ENTER( p_vend->p_log, osm_vendor_set_sm );
> +   if (is_sm_val == FALSE ||  osm_vendor_set_sm_indicator == FALSE)

I may have a comment on this based on the answer to the below.

> +   {
>  	sprintf(issmstring, "/dev/infiniband/issm%d", p_vend->umad_port_id);
>  	if (TRUE == is_sm_val) {
>  		p_vend->issmfd = open(issmstring, 0);
> @@ -1162,6 +1165,15 @@ osm_vendor_set_sm(
>  				" mask failed: errno %d\n", errno);
>  		p_vend->issmfd = -1;
>  	}
> +     if ( osm_vendor_set_sm_indicator == FALSE )
> +       osm_vendor_set_sm_indicator = TRUE;
> +   }
> +   else
> +   {
> +     osm_log(p_vend->p_log, OSM_LOG_ERROR,
> +             "osm_vendor_set_sm: ERR 5436: "
> +             "Trying to set IS_SM capability mask again\n");
> +   }
>  	OSM_LOG_EXIT( p_vend->p_log );
>  }

Does osm_vendor_set_sm_indicator ever needs to be reset to FALSE ?

> Index: opensm/osm_node_info_rcv.c
> ===================================================================
> --- opensm/osm_node_info_rcv.c	(revision 4951)
> +++ opensm/osm_node_info_rcv.c	(working copy)
> @@ -229,6 +229,14 @@ __osm_ni_rcv_set_links(
>                osm_dump_dr_path(p_rcv->p_log,
>                                 osm_physp_get_dr_path_ptr(p_physp),
>                                 OSM_LOG_ERROR);
> +
> +            osm_log( p_rcv->p_log, OSM_LOG_SYS,
> +                     "Errors on subnet. Duplicate GUID found "
> +                     "by link from a port to itself. "
> +                     "See osm log for more details\n");
> +
> +            if ( p_rcv->p_subn->opt.exit_on_fatal == TRUE )
> +              exit( 1 );
>            }
>            else
>            {
> 





More information about the general mailing list