***SPAM*** Re: [ofa-general] OpenSM Problems/Questions
Ira Weiny
weiny2 at llnl.gov
Thu Sep 11 14:13:01 PDT 2008
On Thu, 11 Sep 2008 23:36:27 +0300
Sasha Khapyorsky <sashak at voltaire.com> wrote:
> On 12:11 Tue 09 Sep , Ira Weiny wrote:
> > >
> > > > The following problem that is being encountered may also be SA/SM related. A
> > > > node (NodeX) may be seen (through IPoIB) by all but a few nodes (NodesA-G).
> > > > A ping from those node (NodesA-G) to NodeX returns "Destination Host
> > > > Unreachable". A ping from NodeX to NodesA-G works.
> > >
> > > Sounds like perhaps those nodes were unable to join the broadcast
> > > group perhaps due to a rate issue.
> >
> > Hal is correct, and saquery is your friend here. If you use "genders" and
> > "whatsup" (https:// computing.llnl.gov/linux/downloads.html) I have a series of
> > tools "Pragmatic InfiniBand Utilities (PIU)"
> > (https:// computing.llnl.gov/linux/piu.html) which includes a tool called
> > "ibnodeinmcast" which can help debug this. What it does is use saquery [-g|-m]
> > to find nodes in the multicast groups. With the addition of other LLNL tools
> > this can be boiled down to which nodes "should" be in the group but are not.
> > You are welcome to download that package and adapt it to your environment.
>
> Also there was your fix (after OFED 1.3) which is pretty related to
> unstable links.
True, but as I understood this is happening right after boot. Is this true?
Ira
>
> Sasha
>
>
> commit e40NB597af556fce55e3b205b0cc4ffa6805aeaa
> Author: Ira Weiny <weiny2 at llnl.gov>
> Date: Thu Apr 24 18:16:57 2008 -0700
>
> opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
>
> (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)
>
> I did not get any output with multicast_debug_level! But I added some more
> debugging and finally realized that the set was not being sent. :-( I put a
> debug statement in OpenSM where the flag was set and therefore thought that
> OpenSM had set the rereg bit. However, since no other data had changed the
> "set" MAD was not sent. (I am getting a bit tongue tied reading this back. I
> hope that all makes sense.)
>
> Here is a patch which fixes the problem. (At least with the partial sub-nets
> configuration I explained before.) I will have to verify this fixes the problem
> I originally reported.
>
> Ira
>
> From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <weiny2 at llnl.gov>
> Date: Thu, 24 Apr 2008 18:05:01 -0700
> Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit
>
> Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>
> diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c
> index ab23929..4d628d2 100644
> --- a/opensm/opensm/osm_lid_mgr.c
> +++ b/opensm/opensm/osm_lid_mgr.c
> @@ -1099,9 +1099,14 @@ __osm_lid_mgr_set_physp_pi(IN osm_lid_mgr_t * const p_mgr,
> if ((p_mgr->p_subn->first_time_master_sweep == TRUE || p_port->is_new)
> && !p_mgr->p_subn->opt.no_clients_rereg
> && ((p_old_pi->capability_mask & IB_PORT_CAP_HAS_CLIENT_REREG) !=
> - 0))
> + 0)) {
> + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> + "Seting client rereg on %s, port %d\n",
> + p_port->p_node->print_desc,
> + p_port->p_physp->port_num);
> ib_port_info_set_client_rereg(p_pi, 1);
> - else
> + send_set = TRUE;
> + } else
> ib_port_info_set_client_rereg(p_pi, 0);
>
> /* We need to send the PortInfo Set request with the new sm_lid
>
More information about the general
mailing list