[ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches

Hal Rosenstock hal.rosenstock at gmail.com
Tue Aug 4 09:45:05 PDT 2009


On Tue, Aug 4, 2009 at 11:27 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:

> Hi,
>
> On 19:28 Thu 30 Jul     , Hal Rosenstock wrote:
> >
> > Currently, MADs are pipelined to a single switch at a time which
> > effectively serializes these requests due to processing at the SMA.
> > This patch pipelines (stripes) them across the switches first before
> > proceeding with successive blocks. As a result of this striping,
> > multiple switches can process the set and respond concurrently
> > which results in an improvement to the subnet initialization time.
>
> The idea is nice. However I have some initial comments about an
> implementation.
>
> BTW should there be a reason for an option to preserve the current
> behavior? (I don't know, just asking)


I asked this in an email on the thread on this. It's up to you. I don't see
a need but if we want to be conservative, it can be added.


>
>
> > This patch also introduces a new config option (max_smps_per_node)
> > which indicates how deep the per node pipeline is (current default is 4).
> > This also has the effect of limiting the number of times that the switch
> > list is traversed. Maybe this embellishment is unnecessary.
>
> Then why is it needed?


Also, as was discussed in the thread on this, it gives a way to control
possible VL15 overflow.


>
>
> > All unicast routing protocols are updated for this with the exception
> > of file.
> >
> > A similar subsequent change will do this for MFTs.
> >
> > Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:
> >
> > With a small cluster of 17 IS4 switches and 11 HCAs and
> > to artificially increase the cluster, LMC of 7 was used
> > including EnhancedSwitchPort 0 LMC.
> >
> > With the new code, LFT configuration is more than twice as
> > fast as with the old code :)
> > Current ucast manager ran on avarage for ~250msec, with the
> > new code - 110-120msec.
> >
> > Routing calculation phase of the ucast manager took ~1200 usec,
> > the rest was sending the blocks and waiting for no more pending
> > transactions.
> >
> > No noticeable difference between various max_smps_per_node values
> > was observed.
>
> What is the reason?


I think the reason was max_wire_smps may have kicked in but Yevgeny is best
to elaborate on this.


> And what was value of 'max_wire_smps'?
>

> Here are some detailed results of different executions (the
> number on the left is timer value in usec):
>
> Current ucast manager (w/o the optimization):
>
> 000000 [LFT]: osm_ucast_mgr_process() - START
> 001131 [LFT]: ucast_mgr_process_tbl() - START
> 032251 [LFT]: ucast_mgr_process_tbl() - END
> 032263 [LFT]: osm_ucast_mgr_process() - END
> 253416 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=0:
>
> 001417 [LFT]: osm_ucast_mgr_process() - START (0 max_smps_per_node)
> 002690 [LFT]: ucast_mgr_process_tbl() - START
> 032946 [LFT]: ucast_mgr_process_tbl() - END
> 032948 [LFT]: osm_ucast_pipeline_tbl() - START
> 033846 [LFT]: osm_ucast_pipeline_tbl() - END
> 033858 [LFT]: osm_ucast_mgr_process() - END
> 108203 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=1:
>
> 007474 [LFT]: osm_ucast_mgr_process() - START (1 max_smps_per_node)
> 008735 [LFT]: ucast_mgr_process_tbl() - START
> 040071 [LFT]: ucast_mgr_process_tbl() - END
> 040074 [LFT]: osm_ucast_pipeline_tbl() - START
> 040103 [LFT]: osm_ucast_pipeline_tbl() - END
> 040114 [LFT]: osm_ucast_mgr_process() - END
> 120097 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=4:
>
> 004137 [LFT]: osm_ucast_mgr_process() - START (4 max_smps_per_node)
> 005380 [LFT]: ucast_mgr_process_tbl() - START
> 037436 [LFT]: ucast_mgr_process_tbl() - END
> 037439 [LFT]: osm_ucast_pipeline_tbl() - START
> 037495 [LFT]: osm_ucast_pipeline_tbl() - END
> 037506 [LFT]: osm_ucast_mgr_process() - END
> 114983 [LFT]: Done wait_for_pending_transactions()
>
>
> With IS3 based Qlogic switches, which do not handle DR packets forwarding
> in HW, with a fabric of ~1100 HCAs, ~280 switches:
>
> Current OSM configures LFTs in ~2 seconds.
> New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up),
> depending on the max_smps_per_node value.
>
> As in case of IS4 switches, the shortest config time was obtained with
> max_smps_per_node=0, which is unlimited pipeline.
>
>
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> Changes since v1:
> Added Yevgeny's performance data to patch description above
> No change to actual patch
>
> diff --git a/opensm/include/opensm/osm_base.h
b/opensm/include/opensm/osm_base.h
> index 0537002..617e8a9 100644
> --- a/opensm/include/opensm/osm_base.h
> +++ b/opensm/include/opensm/osm_base.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights
reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
>   *
> @@ -449,6 +449,18 @@ BEGIN_C_DECLS
>  */
>  #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4
>  /***********/
> +/****d* OpenSM: Base/OSM_DEFAULT_SMP_MAX_PER_NODE
> +* NAME
> +*    OSM_DEFAULT_SMP_MAX_PER_NODE
> +*
> +* DESCRIPTION
> +*    Specifies the default number of VL15 SMP MADs allowed
> +*    per node for certain attributes.
> +*
> +* SYNOPSIS
> +*/
> +#define OSM_DEFAULT_SMP_MAX_PER_NODE 4
> +/***********/
>  /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE
>  * NAME
>  *    OSM_SM_DEFAULT_QP0_RCV_SIZE
> diff --git a/opensm/include/opensm/osm_sm.h
b/opensm/include/opensm/osm_sm.h
> index cc8321d..1776380 100644
> --- a/opensm/include/opensm/osm_sm.h
> +++ b/opensm/include/opensm/osm_sm.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
reserved.
>   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
> @@ -130,6 +130,7 @@ typedef struct osm_sm {
>       osm_sm_mad_ctrl_t mad_ctrl;
>       osm_lid_mgr_t lid_mgr;
>       osm_ucast_mgr_t ucast_mgr;
> +     boolean_t lfts_updated;

The name is unclear - actually this means "update in progress".


OK.


>
>
> >       cl_disp_reg_handle_t sweep_fail_disp_h;
> >       cl_disp_reg_handle_t ni_disp_h;
> >       cl_disp_reg_handle_t pi_disp_h;
> > @@ -524,6 +525,45 @@ osm_resp_send(IN osm_sm_t * sm,
> >  *
> >  *********/
> >
> > +/****f* OpenSM: SM/osm_sm_set_next_lft_block
> > +* NAME
> > +*    osm_sm_set_next_lft_block
> > +*
> > +* DESCRIPTION
> > +*    Set the next LFT (LinearForwardingTable) block in the indicated
> switch.
> > +*
> > +* SYNOPSIS
> > +*/
> > +void
> > +osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> > +                       IN uint8_t *p_block, IN osm_dr_path_t *p_path,
> > +                       IN osm_madw_context_t *p_context);
>
> Why should it be in osm_sm.[ch]? osm_ucast_mgr.c or osm_switch.c seem
> much more appropriate place for this.


OK.


>
>
> > +/*
> > +* PARAMETERS
> > +*    p_sm
> > +*            [in] Pointer to an osm_sm_t object.
> > +*
> > +*    p_switch
> > +*            [in] Pointer to the switch object.
> > +*
> > +*    p_block
> > +*            [in] Pointer to the forwarding table block.
> > +*
> > +*    p_path
> > +*            [in] Pointer to a directed route path object.
> > +*
> > +*    p_context
> > +*            [in] Mad wrapper context structure to be copied into the
> wrapper
> > +*            context, and thus visible to the recipient of the response.
> > +*
> > +* RETURN VALUES
> > +*    None
> > +*
> > +* NOTES
> > +*
> > +* SEE ALSO
> > +*********/
> > +
> >  /****f* OpenSM: SM/osm_sm_mcgrp_join
> >  * NAME
> >  *    osm_sm_mcgrp_join
> > diff --git a/opensm/include/opensm/osm_subnet.h
> b/opensm/include/opensm/osm_subnet.h
> > index 59a32ad..f12afae 100644
> > --- a/opensm/include/opensm/osm_subnet.h
> > +++ b/opensm/include/opensm/osm_subnet.h
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
> >   *
> > @@ -147,6 +147,7 @@ typedef struct osm_subn_opt {
> >       uint32_t sweep_interval;
> >       uint32_t max_wire_smps;
> >       uint32_t transaction_timeout;
> > +     uint32_t max_smps_per_node;
> >       uint8_t sm_priority;
> >       uint8_t lmc;
> >       boolean_t lmc_esp0;
> > diff --git a/opensm/include/opensm/osm_switch.h
> b/opensm/include/opensm/osm_switch.h
> > index 7ce28c5..e12113f 100644
> > --- a/opensm/include/opensm/osm_switch.h
> > +++ b/opensm/include/opensm/osm_switch.h
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -102,6 +102,7 @@ typedef struct osm_switch {
> >       osm_port_profile_t *p_prof;
> >       uint8_t *lft;
> >       uint8_t *new_lft;
> > +     uint16_t lft_block_id_ho;
> >       osm_mcast_tbl_t mcast_tbl;
> >       unsigned endport_links;
> >       unsigned need_update;
> > diff --git a/opensm/include/opensm/osm_ucast_mgr.h
> b/opensm/include/opensm/osm_ucast_mgr.h
> > index a040476..fdea49a 100644
> > --- a/opensm/include/opensm/osm_ucast_mgr.h
> > +++ b/opensm/include/opensm/osm_ucast_mgr.h
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -233,17 +233,42 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const
> p_mgr, IN struct osm_sm * sm);
> >  *    osm_ucast_mgr_destroy
> >  *********/
> >
> > -/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_table
> > +/****f* OpenSM: Unicast Manager/osm_ucast_pipeline_tbl
> >  * NAME
> > -*    osm_ucast_mgr_set_fwd_table
> > +*    osm_ucast_pipeline_tbl
> >  *
> >  * DESCRIPTION
> > -*    Setup forwarding table for the switch (from prepared new_lft).
> > +*    The osm_ucast_pipeline_tbl function pipelines the LFT
> > +*    (LinearForwardingTable) sets across the switches
> > +*    (from prepared new_lft).
> >  *
> >  * SYNOPSIS
> >  */
> > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
> > -                             IN osm_switch_t * const p_sw);
> > +void osm_ucast_pipeline_tbl(IN osm_ucast_mgr_t * p_mgr);
> > +/*
> > +* PARAMETERS
> > +*    p_mgr
> > +*            [in] Pointer to an osm_ucast_mgr_t object.
> > +*
> > +* RETURN VALUES
> > +*    None.
> > +*
> > +* NOTES
> > +*
> > +* SEE ALSO
> > +*********/
> > +
> > +/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_tbl_top
> > +* NAME
> > +*    osm_ucast_mgr_set_fwd_tbl_top
> > +*
> > +* DESCRIPTION
> > +*    Setup LinearFDBTop for the switch.
> > +*
> > +* SYNOPSIS
> > +*/
> > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * const p_mgr,
> > +                               IN osm_switch_t * const p_sw);
>
> I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and
> osm_ucast_pipeline_tbl).


Why not ? What's the matter with doing this ?


> Why to not use a single function and update all
> routing engines appropriately (you need to do it anyway), so that this
> will only fill up new_lfts table?


I'm not following what you're describing. set_fwd_tbl_top sets LinearFDBTop
whereas pipeline_tbl starts the cascade of LFT sets based on
max_smps_per_node.


>
>
> >  /*
> >  * PARAMETERS
> >  *    p_mgr
> > diff --git a/opensm/opensm/osm_lin_fwd_rcv.c
> b/opensm/opensm/osm_lin_fwd_rcv.c
> > index 2edb8d3..cb131b4 100644
> > --- a/opensm/opensm/osm_lin_fwd_rcv.c
> > +++ b/opensm/opensm/osm_lin_fwd_rcv.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -36,7 +36,7 @@
> >  /*
> >   * Abstract:
> >   *    Implementation of osm_lft_rcv_t.
> > - * This object represents the NodeDescription Receiver object.
> > + * This object represents the Linear Forwarding Table Receiver object.
> >   * This object is part of the opensm family of objects.
> >   */
> >
> > @@ -55,6 +55,7 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> >  {
> >       osm_sm_t *sm = context;
> >       osm_madw_t *p_madw = data;
> > +     osm_dr_path_t *p_path;
> >       ib_smp_t *p_smp;
> >       uint32_t block_num;
> >       osm_switch_t *p_sw;
> > @@ -62,6 +63,8 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> >       uint8_t *p_block;
> >       ib_net64_t node_guid;
> >       ib_api_status_t status;
> > +     uint8_t block[IB_SMP_DATA_SIZE];
> > +     osm_madw_context_t mad_context;
> >
> >       CL_ASSERT(sm);
> >
> > @@ -94,6 +97,16 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> >                               "\n\t\t\t\tSwitch 0x%" PRIx64 "\n",
> >                               ib_get_err_str(status),
> cl_ntoh64(node_guid));
> >               }
> > +
> > +             p_path =
> osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> > +
> > +             mad_context.lft_context.node_guid = node_guid;
> > +             mad_context.lft_context.set_method = TRUE;
> > +
> > +             osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path,
> > +                                       &mad_context);
> > +
> > +             p_sw->lft_block_id_ho++;
>
> Wouldn't it be simpler to encode block_id in a mad context?


Why simpler ? I think it complicates the receiver code to do that (assuming
max_smps_per_node remains).


>
>
> >       }
> >
> >       CL_PLOCK_RELEASE(sm->p_lock);
> > diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
> > index daa60ff..4e0fd2a 100644
> > --- a/opensm/opensm/osm_sm.c
> > +++ b/opensm/opensm/osm_sm.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
> >   *
> > @@ -441,6 +441,45 @@ Exit:
> >
> >  /**********************************************************************
> >   **********************************************************************/
> > +void osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> > +                            IN uint8_t *p_block, IN osm_dr_path_t
> *p_path,
> > +                            IN osm_madw_context_t *context)
> > +{
> > +     ib_api_status_t status;
> > +
> > +     for (;
> > +          osm_switch_get_lft_block(p_sw, p_sw->lft_block_id_ho,
> p_block);
> > +          p_sw->lft_block_id_ho++) {
> > +             if (!p_sw->need_update && !p_sm->p_subn->need_update &&
> > +                 !memcmp(p_block,
> > +                         p_sw->new_lft + p_sw->lft_block_id_ho *
> IB_SMP_DATA_SIZE,
> > +                         IB_SMP_DATA_SIZE))
> > +                     continue;
> > +
> > +             p_sm->lfts_updated = 1;
> > +
> > +             OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
> > +                     "Writing FT block %u to switch 0x%" PRIx64 "\n",
> > +                     p_sw->lft_block_id_ho,
> > +                     cl_ntoh64(context->lft_context.node_guid));
> > +
> > +             status = osm_req_set(p_sm, p_path,
> > +                                  p_sw->new_lft +
> > +                                  p_sw->lft_block_id_ho *
> IB_SMP_DATA_SIZE,
> > +                                  IB_SMP_DATA_SIZE,
> IB_MAD_ATTR_LIN_FWD_TBL,
> > +                                  cl_hton32(p_sw->lft_block_id_ho),
> > +                                  CL_DISP_MSGID_NONE, context);
> > +
> > +             if (status != IB_SUCCESS)
> > +                     OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E11: "
> > +                             "Sending linear fwd. tbl. block failed
> (%s)\n",
> > +                             ib_get_err_str(status));
> > +             break;
> > +     }
> > +}
> > +
> > +/**********************************************************************
> > + **********************************************************************/
> >  static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm,
> >                                      IN osm_mgrp_t * p_mgrp)
> >  {
> > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> > index ec15f8a..1964b7f 100644
> > --- a/opensm/opensm/osm_subnet.c
> > +++ b/opensm/opensm/osm_subnet.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
> >   *
> > @@ -295,6 +295,7 @@ static const opt_rec_t opt_tbl[] = {
> >       { "m_key_lease_period", OPT_OFFSET(m_key_lease_period),
> opts_parse_net16, NULL, 1 },
> >       { "sweep_interval", OPT_OFFSET(sweep_interval), opts_parse_uint32,
> NULL, 1 },
> >       { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_parse_uint32,
> NULL, 1 },
> > +     { "max_smps_per_node", OPT_OFFSET(max_smps_per_node),
> opts_parse_uint32, NULL, 1 },
> >       { "console", OPT_OFFSET(console), opts_parse_charp, NULL, 0 },
> >       { "console_port", OPT_OFFSET(console_port), opts_parse_uint16,
> NULL, 0 },
> >       { "transaction_timeout", OPT_OFFSET(transaction_timeout),
> opts_parse_uint32, NULL, 1 },
> > @@ -671,6 +672,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t *
> const p_opt)
> >       p_opt->m_key_lease_period = 0;
> >       p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS;
> >       p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
> > +     p_opt->max_smps_per_node = OSM_DEFAULT_SMP_MAX_PER_NODE;
> >       p_opt->console = strdup(OSM_DEFAULT_CONSOLE);
> >       p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT;
> >       p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
> > @@ -1461,6 +1463,10 @@ int osm_subn_output_conf(FILE *out, IN
> osm_subn_opt_t *const p_opts)
> >               "max_wire_smps %u\n\n"
> >               "# The maximum time in [msec] allowed for a transaction to
> complete\n"
> >               "transaction_timeout %u\n\n"
> > +             "# Maximum number of SMPs per node sent in parallel\n"
> > +             "# (0 means unlimited)\n"
> > +             "# Only applies to certain attributes\n"
> > +             "max_smps_per_node %u\n\n"
> >               "# Maximal time in [msec] a message can stay in the
> incoming message queue.\n"
> >               "# If there is more than one message in the queue and the
> last message\n"
> >               "# stayed in the queue more than this value, any SA request
> will be\n"
> > @@ -1470,6 +1476,7 @@ int osm_subn_output_conf(FILE *out, IN
> osm_subn_opt_t *const p_opts)
> >               "single_thread %s\n\n",
> >               p_opts->max_wire_smps,
> >               p_opts->transaction_timeout,
> > +             p_opts->max_smps_per_node,
> >               p_opts->max_msg_fifo_timeout,
> >               p_opts->single_thread ? "TRUE" : "FALSE");
> >
> > diff --git a/opensm/opensm/osm_ucast_cache.c
> b/opensm/opensm/osm_ucast_cache.c
> > index 216b496..31c930b 100644
> > --- a/opensm/opensm/osm_ucast_cache.c
> > +++ b/opensm/opensm/osm_ucast_cache.c
> > @@ -1,5 +1,5 @@
> >  /*
> > - * Copyright (c) 2008      Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights
> reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> >   * licenses.  You may choose to be licensed under the terms of the GNU
> > @@ -1085,9 +1085,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t *
> p_mgr)
> >                       memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO
> + 1);
> >               }
> >
> > -             osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> > +             osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
> >       }
> >
> > +     osm_ucast_pipeline_tbl(p_mgr);
> > +
> >       return 0;
> >  }
> >
> > diff --git a/opensm/opensm/osm_ucast_file.c
> b/opensm/opensm/osm_ucast_file.c
> > index 2505c46..099e8ba 100644
> > --- a/opensm/opensm/osm_ucast_file.c
> > +++ b/opensm/opensm/osm_ucast_file.c
> > @@ -168,8 +168,8 @@ static int do_ucast_file_load(void *context)
> >                               "routing algorithm\n");
> >               } else if (!strncmp(p, "Unicast lids", 12)) {
> >                       if (p_sw)
> > -                             osm_ucast_mgr_set_fwd_table(&p_osm->sm.
> > -                                                         ucast_mgr,
> p_sw);
> > +                             osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.
> > +                                                           ucast_mgr,
> p_sw);
> >                       q = strstr(p, " guid 0x");
> >                       if (!q) {
> >                               OSM_LOG(&p_osm->log, OSM_LOG_ERROR,
> > @@ -247,7 +247,7 @@ static int do_ucast_file_load(void *context)
> >       }
> >
> >       if (p_sw)
> > -             osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> > +             osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
> >
> >       fclose(file);
> >       return 0;
>
> I suppose that this breaks 'file' routing engine (did you test it?) -
> instead of switch LFTs setup this will only update its TOPs.


At this point, I don't recall.


>
>
> > diff --git a/opensm/opensm/osm_ucast_ftree.c
> b/opensm/opensm/osm_ucast_ftree.c
> > index bde6dbd..d65c685 100644
> > --- a/opensm/opensm/osm_ucast_ftree.c
> > +++ b/opensm/opensm/osm_ucast_ftree.c
> > @@ -2,7 +2,7 @@
> >   * Copyright (c) 2009 Simula Research Laboratory. All rights reserved.
> >   * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -1905,8 +1905,8 @@ static void set_sw_fwd_table(IN cl_map_item_t *
> const p_map_item,
> >       ftree_fabric_t *p_ftree = (ftree_fabric_t *) context;
> >
> >       p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid;
> > -     osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr,
> > -                                 p_sw->p_osm_sw);
> > +     osm_ucast_mgr_set_fwd_tbl_top(&p_ftree->p_osm->sm.ucast_mgr,
> > +                                   p_sw->p_osm_sw);
> >  }
> >
> >  /***************************************************
> > @@ -4005,6 +4005,8 @@ static int do_routing(IN void *context)
> >       /* for each switch, set its fwd table */
> >       cl_qmap_apply_func(&p_ftree->sw_tbl, set_sw_fwd_table, (void
> *)p_ftree);
> >
> > +     osm_ucast_pipeline_tbl(&p_ftree->p_osm->sm.ucast_mgr);
> > +
> >       /* write out hca ordering file */
> >       fabric_dump_hca_ordering(p_ftree);
> >
> > diff --git a/opensm/opensm/osm_ucast_lash.c
> b/opensm/opensm/osm_ucast_lash.c
> > index 12b5e34..adf5f6c 100644
> > --- a/opensm/opensm/osm_ucast_lash.c
> > +++ b/opensm/opensm/osm_ucast_lash.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   * Copyright (c) 2007      Simula Research Laboratory. All rights
> reserved.
> >   * Copyright (c) 2007      Silicon Graphics Inc. All rights reserved.
> > @@ -1045,8 +1045,11 @@ static void populate_fwd_tbls(lash_t * p_lash)
> >                                       physical_egress_port);
> >                       }
> >               }               /* for */
> > -             osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> > +             osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
> >       }
> > +
> > +     osm_ucast_pipeline_tbl(&p_osm->sm.ucast_mgr);
> > +
> >       OSM_LOG_EXIT(p_log);
> >  }
> >
> > diff --git a/opensm/opensm/osm_ucast_mgr.c
> b/opensm/opensm/osm_ucast_mgr.c
> > index 78a7031..86d1c98 100644
> > --- a/opensm/opensm/osm_ucast_mgr.c
> > +++ b/opensm/opensm/osm_ucast_mgr.c
> > @@ -1,6 +1,6 @@
> >  /*
> >   * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> >   * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> >   *
> >   * This software is available to you under a choice of one of two
> > @@ -315,16 +315,14 @@ Exit:
> >
> >  /**********************************************************************
> >   **********************************************************************/
> > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
> > -                             IN osm_switch_t * p_sw)
> > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr,
> > +                               IN osm_switch_t * p_sw)
> >  {
> >       osm_node_t *p_node;
> >       osm_dr_path_t *p_path;
> >       osm_madw_context_t context;
> >       ib_api_status_t status;
> >       ib_switch_info_t si;
> > -     uint16_t block_id_ho = 0;
> > -     uint8_t block[IB_SMP_DATA_SIZE];
> >       boolean_t set_swinfo_require = FALSE;
> >       uint16_t lin_top;
> >       uint8_t life_state;
> > @@ -382,48 +380,8 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t *
> p_mgr,
> >                               ib_get_err_str(status));
> >       }
> >
> > -     /*
> > -        Send linear forwarding table blocks to the switch
> > -        as long as the switch indicates it has blocks needing
> > -        configuration.
> > -      */
> > -
> > -     context.lft_context.node_guid = osm_node_get_node_guid(p_node);
> > -     context.lft_context.set_method = TRUE;
> > -
> > -     if (!p_sw->new_lft) {
> > -             /* any routing should provide the new_lft */
> > -             CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> > -                       p_mgr->cache_valid && !p_sw->need_update);
> > -             goto Exit;
> > -     }
> > -
> > -     for (block_id_ho = 0;
> > -          osm_switch_get_lft_block(p_sw, block_id_ho, block);
> > -          block_id_ho++) {
> > -             if (!p_sw->need_update && !p_mgr->p_subn->need_update &&
> > -                 !memcmp(block,
> > -                         p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
> > -                         IB_SMP_DATA_SIZE))
> > -                     continue;
> > -
> > -             OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> > -                     "Writing FT block %u\n", block_id_ho);
> > -
> > -             status = osm_req_set(p_mgr->sm, p_path,
> > -                                  p_sw->new_lft +
> > -                                  block_id_ho * IB_SMP_DATA_SIZE,
> > -                                  sizeof(block),
> IB_MAD_ATTR_LIN_FWD_TBL,
> > -                                  cl_hton32(block_id_ho),
> CL_DISP_MSGID_NONE,
> > -                                  &context);
> > +     p_sw->lft_block_id_ho = 0;
> >
> > -             if (status != IB_SUCCESS)
> > -                     OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: "
> > -                             "Sending linear fwd. tbl. block failed
> (%s)\n",
> > -                             ib_get_err_str(status));
> > -     }
> > -
> > -Exit:
> >       OSM_LOG_EXIT(p_mgr->p_log);
> >       return 0;
> >  }
> > @@ -508,7 +466,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t *
> p_map_item,
> >               }
> >       }
> >
> > -     osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> > +     osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
> >
> >       if (p_mgr->p_subn->opt.lmc)
> >               free_ports_priv(p_mgr);
> > @@ -516,6 +474,47 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t *
> p_map_item,
> >       OSM_LOG_EXIT(p_mgr->p_log);
> >  }
> >
> > +static void ucast_mgr_pipeline_tbl(IN osm_switch_t *p_sw,
> > +                                IN osm_ucast_mgr_t *p_mgr)
> > +{
> > +     osm_dr_path_t *p_path;
> > +     osm_madw_context_t mad_context;
> > +     uint8_t block[IB_SMP_DATA_SIZE];
> > +
> > +     OSM_LOG_ENTER(p_mgr->p_log);
> > +
> > +     CL_ASSERT(p_sw && p_sw->p_node);
> > +
> > +     OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> > +             "Processing switch 0x%" PRIx64 "\n",
> > +             cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
> > +
> > +     /*
> > +        Send linear forwarding table blocks to the switch
> > +        as long as the switch indicates it has blocks needing
> > +        configuration.
> > +      */
> > +     if (!p_sw->new_lft) {
> > +             /* any routing should provide the new_lft */
> > +             CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> > +                       p_mgr->cache_valid && !p_sw->need_update);
> > +             goto Exit;
> > +     }
> > +
> > +     p_path =
> osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> > +
> > +     mad_context.lft_context.node_guid =
> osm_node_get_node_guid(p_sw->p_node);
> > +     mad_context.lft_context.set_method = TRUE;
> > +
> > +     osm_sm_set_next_lft_block(p_mgr->sm, p_sw, &block[0], p_path,
> > +                               &mad_context);
> > +
> > +     p_sw->lft_block_id_ho++;
> > +
> > +Exit:
> > +     OSM_LOG_EXIT(p_mgr->p_log);
> > +}
> > +
> >  /**********************************************************************
> >   **********************************************************************/
> >  static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item,
> > @@ -870,6 +869,28 @@ static void
> sort_ports_by_switch_load(osm_ucast_mgr_t * m)
> >               add_sw_endports_to_order_list(s[i], m);
> >  }
> >
> > +void osm_ucast_pipeline_tbl(osm_ucast_mgr_t * p_mgr)
> > +{
> > +     cl_qmap_t *p_sw_tbl;
> > +     osm_switch_t *p_sw;
> > +     int i;
> > +
> > +     for (i = 0;
> > +          !p_mgr->p_subn->opt.max_smps_per_node ||
> > +          i < p_mgr->p_subn->opt.max_smps_per_node;
> > +          i++) {
> > +             p_mgr->sm->lfts_updated = 0;
> > +             p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
> > +             p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
> > +             while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> > +                     ucast_mgr_pipeline_tbl(p_sw, p_mgr);
> > +                     p_sw = (osm_switch_t *)
> cl_qmap_next(&p_sw->map_item);
> > +             }
> > +             if (!p_mgr->sm->lfts_updated)
> > +                     break;
> > +     }
> > +}
>
> Is it possible (for example in case of send errors) that "partial" LFT
> blocks sending will trigger wait_for_pending_transaction() completion?


I don't know. Is this different from the original algorithm in the case of
send errors ?

-- Hal


>
>
> Sasha
>
> > +
> >  static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
> >  {
> >       cl_qlist_init(&p_mgr->port_order_list);
> > @@ -904,6 +925,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *
> p_mgr)
> >       cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
> ucast_mgr_process_tbl,
> >                          p_mgr);
> >
> > +     osm_ucast_pipeline_tbl(p_mgr);
> > +
> >       cl_qlist_remove_all(&p_mgr->port_order_list);
> >
> >       return 0;
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090804/8e38491b/attachment.html>


More information about the general mailing list