***SPAM*** Re: [ofa-general] OpenSM Problems/Questions

Sasha Khapyorsky sashak at voltaire.com
Thu Sep 11 13:36:27 PDT 2008


On 12:11 Tue 09 Sep     , Ira Weiny wrote:
> > 
> > > The following problem that is being encountered may also be SA/SM related. A
> > > node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G).
> > > A ping from those node (NodesA-G) to NodeX returns "Destination Host
> > > Unreachable". A ping from NodeX to NodesA-G works.
> > 
> > Sounds like perhaps those nodes were unable to join the broadcast
> > group perhaps due to a rate issue.
> 
> Hal is correct, and saquery is your friend here. If you use "genders" and
> "whatsup" (https://computing.llnl.gov/linux/downloads.html) I have a series of
> tools "Pragmatic InfiniBand Utilities (PIU)"
> (https://computing.llnl.gov/linux/piu.html) which includes a tool called
> "ibnodeinmcast" which can help debug this.  What it does is use saquery [-g|-m]
> to find nodes in the multicast groups.  With the addition of other LLNL tools
> this can be boiled down to which nodes "should" be in the group but are not.
> You are welcome to download that package and adapt it to your environment.

Also there was your fix (after OFED 1.3) which is pretty related to
unstable links.

Sasha


commit e40NB597af556fce55e3b205b0cc4ffa6805aeaa
Author: Ira Weiny <weiny2 at llnl.gov>
Date:   Thu Apr 24 18:16:57 2008 -0700

    opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
    
    (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)
    
    I did not get any output with multicast_debug_level!  But I added some more
    debugging and finally realized that the set was not being sent.  :-(  I put a
    debug statement in OpenSM where the flag was set and therefore thought that
    OpenSM had set the rereg bit.  However, since no other data had changed the
    "set" MAD was not sent.  (I am getting a bit tongue tied reading this back.  I
    hope that all makes sense.)
    
    Here is a patch which fixes the problem.  (At least with the partial sub-nets
    configuration I explained before.)  I will have to verify this fixes the problem
    I originally reported.
    
    Ira
    
    From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
    From: Ira K. Weiny <weiny2 at llnl.gov>
    Date: Thu, 24 Apr 2008 18:05:01 -0700
    Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
    
    Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
    Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
index ab23929..4d628d2 100644
--- a/opensm/opensm/osm_lid_mgr.c
+++ b/opensm/opensm/osm_lid_mgr.c
@@ -1099,9 +1099,14 @@ __osm_lid_mgr_set_physp_pi(IN osm_lid_mgr_t * const p_mgr,
 	if ((p_mgr->p_subn->first_time_master_sweep == TRUE || p_port->is_new)
 	    && !p_mgr->p_subn->opt.no_clients_rereg
 	    && ((p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) !=
-		0))
+		0)) {
+		OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
+			"Seting client rereg on %s, port %d\n",
+			p_port->p_node->print_desc,
+			p_port->p_physp->port_num);
 		ib_port_info_set_client_rereg(p_pi, 1);
-	else
+		send_set = TRUE;
+	} else
 		ib_port_info_set_client_rereg(p_pi, 0);
 
 	/* We need to send the PortInfo Set request with the new sm_lid



More information about the general mailing list