***SPAM*** Re: [ofa-general] OpenSM Problems/Questions

Ira Weiny weiny2 at llnl.gov
Thu Sep 11 14:13:01 PDT 2008


On Thu, 11 Sep 2008 23:36:27 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:

> On 12:11 Tue 09 Sep     , Ira Weiny wrote:
> > > 
> > > > The following problem that is being encountered may also be SA/SM related. A
> > > > node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G).
> > > > A ping from those node (NodesA-G) to NodeX returns "Destination Host
> > > > Unreachable". A ping from NodeX to NodesA-G works.
> > > 
> > > Sounds like perhaps those nodes were unable to join the broadcast
> > > group perhaps due to a rate issue.
> > 
> > Hal is correct, and saquery is your friend here. If you use "genders" and
> > "whatsup" (https:// computing.llnl.gov/linux/downloads.html) I have a series of
> > tools "Pragmatic InfiniBand Utilities (PIU)"
> > (https:// computing.llnl.gov/linux/piu.html) which includes a tool called
> > "ibnodeinmcast" which can help debug this.  What it does is use saquery [-g|-m]
> > to find nodes in the multicast groups.  With the addition of other LLNL tools
> > this can be boiled down to which nodes "should" be in the group but are not.
> > You are welcome to download that package and adapt it to your environment.
> 
> Also there was your fix (after OFED 1.3) which is pretty related to
> unstable links.

True, but as I understood this is happening right after boot.  Is this true?

Ira

> 
> Sasha
> 
> 
> commit e40NB597af556fce55e3b205b0cc4ffa6805aeaa
> Author: Ira Weiny <weiny2 at llnl.gov>
> Date:   Thu Apr 24 18:16:57 2008 -0700
> 
>     opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
>     
>     (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)
>     
>     I did not get any output with multicast_debug_level!  But I added some more
>     debugging and finally realized that the set was not being sent.  :-(  I put a
>     debug statement in OpenSM where the flag was set and therefore thought that
>     OpenSM had set the rereg bit.  However, since no other data had changed the
>     "set" MAD was not sent.  (I am getting a bit tongue tied reading this back.  I
>     hope that all makes sense.)
>     
>     Here is a patch which fixes the problem.  (At least with the partial sub-nets
>     configuration I explained before.)  I will have to verify this fixes the problem
>     I originally reported.
>     
>     Ira
>     
>     From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
>     From: Ira K. Weiny <weiny2 at llnl.gov>
>     Date: Thu, 24 Apr 2008 18:05:01 -0700
>     Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
>     
>     Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
>     Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> 
> diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
> index ab23929..4d628d2 100644
> --- a/opensm/opensm/osm_lid_mgr.c
> +++ b/opensm/opensm/osm_lid_mgr.c
> @@ -1099,9 +1099,14 @@ __osm_lid_mgr_set_physp_pi(IN osm_lid_mgr_t * const p_mgr,
>  	if ((p_mgr->p_subn->first_time_master_sweep == TRUE || p_port->is_new)
>  	    && !p_mgr->p_subn->opt.no_clients_rereg
>  	    && ((p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) !=
> -		0))
> +		0)) {
> +		OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> +			"Seting client rereg on %s, port %d\n",
> +			p_port->p_node->print_desc,
> +			p_port->p_physp->port_num);
>  		ib_port_info_set_client_rereg(p_pi, 1);
> -	else
> +		send_set = TRUE;
> +	} else
>  		ib_port_info_set_client_rereg(p_pi, 0);
>  
>  	/* We need to send the PortInfo Set request with the new sm_lid
> 



More information about the general mailing list