[ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches
Hal Rosenstock
hal.rosenstock at gmail.com
Tue Aug 4 09:45:05 PDT 2009
On Tue, Aug 4, 2009 at 11:27 AM, Sasha Khapyorsky <sashak at voltaire.com>wrote:
> Hi,
>
> On 19:28 Thu 30 Jul , Hal Rosenstock wrote:
> >
> > Currently, MADs are pipelined to a single switch at a time which
> > effectively serializes these requests due to processing at the SMA.
> > This patch pipelines (stripes) them across the switches first before
> > proceeding with successive blocks. As a result of this striping,
> > multiple switches can process the set and respond concurrently
> > which results in an improvement to the subnet initialization time.
>
> The idea is nice. However I have some initial comments about an
> implementation.
>
> BTW should there be a reason for an option to preserve the current
> behavior? (I don't know, just asking)
I asked this in an email on the thread on this. It's up to you. I don't see
a need but if we want to be conservative, it can be added.
>
>
> > This patch also introduces a new config option (max_smps_per_node)
> > which indicates how deep the per node pipeline is (current default is 4).
> > This also has the effect of limiting the number of times that the switch
> > list is traversed. Maybe this embellishment is unnecessary.
>
> Then why is it needed?
Also, as was discussed in the thread on this, it gives a way to control
possible VL15 overflow.
>
>
> > All unicast routing protocols are updated for this with the exception
> > of file.
> >
> > A similar subsequent change will do this for MFTs.
> >
> > Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:
> >
> > With a small cluster of 17 IS4 switches and 11 HCAs and
> > to artificially increase the cluster, LMC of 7 was used
> > including EnhancedSwitchPort 0 LMC.
> >
> > With the new code, LFT configuration is more than twice as
> > fast as with the old code :)
> > Current ucast manager ran on avarage for ~250msec, with the
> > new code - 110-120msec.
> >
> > Routing calculation phase of the ucast manager took ~1200 usec,
> > the rest was sending the blocks and waiting for no more pending
> > transactions.
> >
> > No noticeable difference between various max_smps_per_node values
> > was observed.
>
> What is the reason?
I think the reason was max_wire_smps may have kicked in but Yevgeny is best
to elaborate on this.
> And what was value of 'max_wire_smps'?
>
> Here are some detailed results of different executions (the
> number on the left is timer value in usec):
>
> Current ucast manager (w/o the optimization):
>
> 000000 [LFT]: osm_ucast_mgr_process() - START
> 001131 [LFT]: ucast_mgr_process_tbl() - START
> 032251 [LFT]: ucast_mgr_process_tbl() - END
> 032263 [LFT]: osm_ucast_mgr_process() - END
> 253416 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=0:
>
> 001417 [LFT]: osm_ucast_mgr_process() - START (0 max_smps_per_node)
> 002690 [LFT]: ucast_mgr_process_tbl() - START
> 032946 [LFT]: ucast_mgr_process_tbl() - END
> 032948 [LFT]: osm_ucast_pipeline_tbl() - START
> 033846 [LFT]: osm_ucast_pipeline_tbl() - END
> 033858 [LFT]: osm_ucast_mgr_process() - END
> 108203 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=1:
>
> 007474 [LFT]: osm_ucast_mgr_process() - START (1 max_smps_per_node)
> 008735 [LFT]: ucast_mgr_process_tbl() - START
> 040071 [LFT]: ucast_mgr_process_tbl() - END
> 040074 [LFT]: osm_ucast_pipeline_tbl() - START
> 040103 [LFT]: osm_ucast_pipeline_tbl() - END
> 040114 [LFT]: osm_ucast_mgr_process() - END
> 120097 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=4:
>
> 004137 [LFT]: osm_ucast_mgr_process() - START (4 max_smps_per_node)
> 005380 [LFT]: ucast_mgr_process_tbl() - START
> 037436 [LFT]: ucast_mgr_process_tbl() - END
> 037439 [LFT]: osm_ucast_pipeline_tbl() - START
> 037495 [LFT]: osm_ucast_pipeline_tbl() - END
> 037506 [LFT]: osm_ucast_mgr_process() - END
> 114983 [LFT]: Done wait_for_pending_transactions()
>
>
> With IS3 based Qlogic switches, which do not handle DR packets forwarding
> in HW, with a fabric of ~1100 HCAs, ~280 switches:
>
> Current OSM configures LFTs in ~2 seconds.
> New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up),
> depending on the max_smps_per_node value.
>
> As in case of IS4 switches, the shortest config time was obtained with
> max_smps_per_node=0, which is unlimited pipeline.
>
>
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> Changes since v1:
> Added Yevgeny's performance data to patch description above
> No change to actual patch
>
> diff --git a/opensm/include/opensm/osm_base.h
b/opensm/include/opensm/osm_base.h
> index 0537002..617e8a9 100644
> --- a/opensm/include/opensm/osm_base.h
> +++ b/opensm/include/opensm/osm_base.h
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights
reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
> *
> @@ -449,6 +449,18 @@ BEGIN_C_DECLS
> */
> #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4
> /***********/
> +/****d* OpenSM: Base/OSM_DEFAULT_SMP_MAX_PER_NODE
> +* NAME
> +* OSM_DEFAULT_SMP_MAX_PER_NODE
> +*
> +* DESCRIPTION
> +* Specifies the default number of VL15 SMP MADs allowed
> +* per node for certain attributes.
> +*
> +* SYNOPSIS
> +*/
> +#define OSM_DEFAULT_SMP_MAX_PER_NODE 4
> +/***********/
> /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE
> * NAME
> * OSM_SM_DEFAULT_QP0_RCV_SIZE
> diff --git a/opensm/include/opensm/osm_sm.h
b/opensm/include/opensm/osm_sm.h
> index cc8321d..1776380 100644
> --- a/opensm/include/opensm/osm_sm.h
> +++ b/opensm/include/opensm/osm_sm.h
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -130,6 +130,7 @@ typedef struct osm_sm {
> osm_sm_mad_ctrl_t mad_ctrl;
> osm_lid_mgr_t lid_mgr;
> osm_ucast_mgr_t ucast_mgr;
> + boolean_t lfts_updated;
The name is unclear - actually this means "update in progress".
OK.
>
>
> > cl_disp_reg_handle_t sweep_fail_disp_h;
> > cl_disp_reg_handle_t ni_disp_h;
> > cl_disp_reg_handle_t pi_disp_h;
> > @@ -524,6 +525,45 @@ osm_resp_send(IN osm_sm_t * sm,
> > *
> > *********/
> >
> > +/****f* OpenSM: SM/osm_sm_set_next_lft_block
> > +* NAME
> > +* osm_sm_set_next_lft_block
> > +*
> > +* DESCRIPTION
> > +* Set the next LFT (LinearForwardingTable) block in the indicated
> switch.
> > +*
> > +* SYNOPSIS
> > +*/
> > +void
> > +osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> > + IN uint8_t *p_block, IN osm_dr_path_t *p_path,
> > + IN osm_madw_context_t *p_context);
>
> Why should it be in osm_sm.[ch]? osm_ucast_mgr.c or osm_switch.c seem
> much more appropriate place for this.
OK.
>
>
> > +/*
> > +* PARAMETERS
> > +* p_sm
> > +* [in] Pointer to an osm_sm_t object.
> > +*
> > +* p_switch
> > +* [in] Pointer to the switch object.
> > +*
> > +* p_block
> > +* [in] Pointer to the forwarding table block.
> > +*
> > +* p_path
> > +* [in] Pointer to a directed route path object.
> > +*
> > +* p_context
> > +* [in] Mad wrapper context structure to be copied into the
> wrapper
> > +* context, and thus visible to the recipient of the response.
> > +*
> > +* RETURN VALUES
> > +* None
> > +*
> > +* NOTES
> > +*
> > +* SEE ALSO
> > +*********/
> > +
> > /****f* OpenSM: SM/osm_sm_mcgrp_join
> > * NAME
> > * osm_sm_mcgrp_join
> > diff --git a/opensm/include/opensm/osm_subnet.h
> b/opensm/include/opensm/osm_subnet.h
> > index 59a32ad..f12afae 100644
> > --- a/opensm/include/opensm/osm_subnet.h
> > +++ b/opensm/include/opensm/osm_subnet.h
> > @@ -1,6 +1,6 @@
> > /*
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved.
> > *
> > @@ -147,6 +147,7 @@ typedef struct osm_subn_opt {
> > uint32_t sweep_interval;
> > uint32_t max_wire_smps;
> > uint32_t transaction_timeout;
> > + uint32_t max_smps_per_node;
> > uint8_t sm_priority;
> > uint8_t lmc;
> > boolean_t lmc_esp0;
> > diff --git a/opensm/include/opensm/osm_switch.h
> b/opensm/include/opensm/osm_switch.h
> > index 7ce28c5..e12113f 100644
> > --- a/opensm/include/opensm/osm_switch.h
> > +++ b/opensm/include/opensm/osm_switch.h
> > @@ -1,6 +1,6 @@
> > /*
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > *
> > * This software is available to you under a choice of one of two
> > @@ -102,6 +102,7 @@ typedef struct osm_switch {
> > osm_port_profile_t *p_prof;
> > uint8_t *lft;
> > uint8_t *new_lft;
> > + uint16_t lft_block_id_ho;
> > osm_mcast_tbl_t mcast_tbl;
> > unsigned endport_links;
> > unsigned need_update;
> > diff --git a/opensm/include/opensm/osm_ucast_mgr.h
> b/opensm/include/opensm/osm_ucast_mgr.h
> > index a040476..fdea49a 100644
> > --- a/opensm/include/opensm/osm_ucast_mgr.h
> > +++ b/opensm/include/opensm/osm_ucast_mgr.h
> > @@ -1,6 +1,6 @@
> > /*
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > *
> > * This software is available to you under a choice of one of two
> > @@ -233,17 +233,42 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const
> p_mgr, IN struct osm_sm * sm);
> > * osm_ucast_mgr_destroy
> > *********/
> >
> > -/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_table
> > +/****f* OpenSM: Unicast Manager/osm_ucast_pipeline_tbl
> > * NAME
> > -* osm_ucast_mgr_set_fwd_table
> > +* osm_ucast_pipeline_tbl
> > *
> > * DESCRIPTION
> > -* Setup forwarding table for the switch (from prepared new_lft).
> > +* The osm_ucast_pipeline_tbl function pipelines the LFT
> > +* (LinearForwardingTable) sets across the switches
> > +* (from prepared new_lft).
> > *
> > * SYNOPSIS
> > */
> > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
> > - IN osm_switch_t * const p_sw);
> > +void osm_ucast_pipeline_tbl(IN osm_ucast_mgr_t * p_mgr);
> > +/*
> > +* PARAMETERS
> > +* p_mgr
> > +* [in] Pointer to an osm_ucast_mgr_t object.
> > +*
> > +* RETURN VALUES
> > +* None.
> > +*
> > +* NOTES
> > +*
> > +* SEE ALSO
> > +*********/
> > +
> > +/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_tbl_top
> > +* NAME
> > +* osm_ucast_mgr_set_fwd_tbl_top
> > +*
> > +* DESCRIPTION
> > +* Setup LinearFDBTop for the switch.
> > +*
> > +* SYNOPSIS
> > +*/
> > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * const p_mgr,
> > + IN osm_switch_t * const p_sw);
>
> I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and
> osm_ucast_pipeline_tbl).
Why not ? What's the matter with doing this ?
> Why to not use a single function and update all
> routing engines appropriately (you need to do it anyway), so that this
> will only fill up new_lfts table?
I'm not following what you're describing. set_fwd_tbl_top sets LinearFDBTop
whereas pipeline_tbl starts the cascade of LFT sets based on
max_smps_per_node.
>
>
> > /*
> > * PARAMETERS
> > * p_mgr
> > diff --git a/opensm/opensm/osm_lin_fwd_rcv.c
> b/opensm/opensm/osm_lin_fwd_rcv.c
> > index 2edb8d3..cb131b4 100644
> > --- a/opensm/opensm/osm_lin_fwd_rcv.c
> > +++ b/opensm/opensm/osm_lin_fwd_rcv.c
> > @@ -1,6 +1,6 @@
> > /*
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > *
> > * This software is available to you under a choice of one of two
> > @@ -36,7 +36,7 @@
> > /*
> > * Abstract:
> > * Implementation of osm_lft_rcv_t.
> > - * This object represents the NodeDescription Receiver object.
> > + * This object represents the Linear Forwarding Table Receiver object.
> > * This object is part of the opensm family of objects.
> > */
> >
> > @@ -55,6 +55,7 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> > {
> > osm_sm_t *sm = context;
> > osm_madw_t *p_madw = data;
> > + osm_dr_path_t *p_path;
> > ib_smp_t *p_smp;
> > uint32_t block_num;
> > osm_switch_t *p_sw;
> > @@ -62,6 +63,8 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> > uint8_t *p_block;
> > ib_net64_t node_guid;
> > ib_api_status_t status;
> > + uint8_t block[IB_SMP_DATA_SIZE];
> > + osm_madw_context_t mad_context;
> >
> > CL_ASSERT(sm);
> >
> > @@ -94,6 +97,16 @@ void osm_lft_rcv_process(IN void *context, IN void
> *data)
> > "\n\t\t\t\tSwitch 0x%" PRIx64 "\n",
> > ib_get_err_str(status),
> cl_ntoh64(node_guid));
> > }
> > +
> > + p_path =
> osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> > +
> > + mad_context.lft_context.node_guid = node_guid;
> > + mad_context.lft_context.set_method = TRUE;
> > +
> > + osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path,
> > + &mad_context);
> > +
> > + p_sw->lft_block_id_ho++;
>
> Wouldn't it be simpler to encode block_id in a mad context?
Why simpler ? I think it complicates the receiver code to do that (assuming
max_smps_per_node remains).
>
>
> > }
> >
> > CL_PLOCK_RELEASE(sm->p_lock);
> > diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
> > index daa60ff..4e0fd2a 100644
> > --- a/opensm/opensm/osm_sm.c
> > +++ b/opensm/opensm/osm_sm.c
> > @@ -1,6 +1,6 @@
> > /*
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved.
> > *
> > @@ -441,6 +441,45 @@ Exit:
> >
> > /**********************************************************************
> > **********************************************************************/
> > +void osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> > + IN uint8_t *p_block, IN osm_dr_path_t
> *p_path,
> > + IN osm_madw_context_t *context)
> > +{
> > + ib_api_status_t status;
> > +
> > + for (;
> > + osm_switch_get_lft_block(p_sw, p_sw->lft_block_id_ho,
> p_block);
> > + p_sw->lft_block_id_ho++) {
> > + if (!p_sw->need_update && !p_sm->p_subn->need_update &&
> > + !memcmp(p_block,
> > + p_sw->new_lft + p_sw->lft_block_id_ho *
> IB_SMP_DATA_SIZE,
> > + IB_SMP_DATA_SIZE))
> > + continue;
> > +
> > + p_sm->lfts_updated = 1;
> > +
> > + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
> > + "Writing FT block %u to switch 0x%" PRIx64 "\n",
> > + p_sw->lft_block_id_ho,
> > + cl_ntoh64(context->lft_context.node_guid));
> > +
> > + status = osm_req_set(p_sm, p_path,
> > + p_sw->new_lft +
> > + p_sw->lft_block_id_ho *
> IB_SMP_DATA_SIZE,
> > + IB_SMP_DATA_SIZE,
> IB_MAD_ATTR_LIN_FWD_TBL,
> > + cl_hton32(p_sw->lft_block_id_ho),
> > + CL_DISP_MSGID_NONE, context);
> > +
> > + if (status != IB_SUCCESS)
> > + OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E11: "
> > + "Sending linear fwd. tbl. block failed
> (%s)\n",
> > + ib_get_err_str(status));
> > + break;
> > + }
> > +}
> > +
> > +/**********************************************************************
> > + **********************************************************************/
> > static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm,
> > IN osm_mgrp_t * p_mgrp)
> > {
> > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> > index ec15f8a..1964b7f 100644
> > --- a/opensm/opensm/osm_subnet.c
> > +++ b/opensm/opensm/osm_subnet.c
> > @@ -1,6 +1,6 @@
> > /*
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved.
> > *
> > @@ -295,6 +295,7 @@ static const opt_rec_t opt_tbl[] = {
> > { "m_key_lease_period", OPT_OFFSET(m_key_lease_period),
> opts_parse_net16, NULL, 1 },
> > { "sweep_interval", OPT_OFFSET(sweep_interval), opts_parse_uint32,
> NULL, 1 },
> > { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_parse_uint32,
> NULL, 1 },
> > + { "max_smps_per_node", OPT_OFFSET(max_smps_per_node),
> opts_parse_uint32, NULL, 1 },
> > { "console", OPT_OFFSET(console), opts_parse_charp, NULL, 0 },
> > { "console_port", OPT_OFFSET(console_port), opts_parse_uint16,
> NULL, 0 },
> > { "transaction_timeout", OPT_OFFSET(transaction_timeout),
> opts_parse_uint32, NULL, 1 },
> > @@ -671,6 +672,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t *
> const p_opt)
> > p_opt->m_key_lease_period = 0;
> > p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS;
> > p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
> > + p_opt->max_smps_per_node = OSM_DEFAULT_SMP_MAX_PER_NODE;
> > p_opt->console = strdup(OSM_DEFAULT_CONSOLE);
> > p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT;
> > p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
> > @@ -1461,6 +1463,10 @@ int osm_subn_output_conf(FILE *out, IN
> osm_subn_opt_t *const p_opts)
> > "max_wire_smps %u\n\n"
> > "# The maximum time in [msec] allowed for a transaction to
> complete\n"
> > "transaction_timeout %u\n\n"
> > + "# Maximum number of SMPs per node sent in parallel\n"
> > + "# (0 means unlimited)\n"
> > + "# Only applies to certain attributes\n"
> > + "max_smps_per_node %u\n\n"
> > "# Maximal time in [msec] a message can stay in the
> incoming message queue.\n"
> > "# If there is more than one message in the queue and the
> last message\n"
> > "# stayed in the queue more than this value, any SA request
> will be\n"
> > @@ -1470,6 +1476,7 @@ int osm_subn_output_conf(FILE *out, IN
> osm_subn_opt_t *const p_opts)
> > "single_thread %s\n\n",
> > p_opts->max_wire_smps,
> > p_opts->transaction_timeout,
> > + p_opts->max_smps_per_node,
> > p_opts->max_msg_fifo_timeout,
> > p_opts->single_thread ? "TRUE" : "FALSE");
> >
> > diff --git a/opensm/opensm/osm_ucast_cache.c
> b/opensm/opensm/osm_ucast_cache.c
> > index 216b496..31c930b 100644
> > --- a/opensm/opensm/osm_ucast_cache.c
> > +++ b/opensm/opensm/osm_ucast_cache.c
> > @@ -1,5 +1,5 @@
> > /*
> > - * Copyright (c) 2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights
> reserved.
> > *
> > * This software is available to you under a choice of one of two
> > * licenses. You may choose to be licensed under the terms of the GNU
> > @@ -1085,9 +1085,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t *
> p_mgr)
> > memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO
> + 1);
> > }
> >
> > - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> > + osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
> > }
> >
> > + osm_ucast_pipeline_tbl(p_mgr);
> > +
> > return 0;
> > }
> >
> > diff --git a/opensm/opensm/osm_ucast_file.c
> b/opensm/opensm/osm_ucast_file.c
> > index 2505c46..099e8ba 100644
> > --- a/opensm/opensm/osm_ucast_file.c
> > +++ b/opensm/opensm/osm_ucast_file.c
> > @@ -168,8 +168,8 @@ static int do_ucast_file_load(void *context)
> > "routing algorithm\n");
> > } else if (!strncmp(p, "Unicast lids", 12)) {
> > if (p_sw)
> > - osm_ucast_mgr_set_fwd_table(&p_osm->sm.
> > - ucast_mgr,
> p_sw);
> > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.
> > + ucast_mgr,
> p_sw);
> > q = strstr(p, " guid 0x");
> > if (!q) {
> > OSM_LOG(&p_osm->log, OSM_LOG_ERROR,
> > @@ -247,7 +247,7 @@ static int do_ucast_file_load(void *context)
> > }
> >
> > if (p_sw)
> > - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
> >
> > fclose(file);
> > return 0;
>
> I suppose that this breaks 'file' routing engine (did you test it?) -
> instead of switch LFTs setup this will only update its TOPs.
At this point, I don't recall.
>
>
> > diff --git a/opensm/opensm/osm_ucast_ftree.c
> b/opensm/opensm/osm_ucast_ftree.c
> > index bde6dbd..d65c685 100644
> > --- a/opensm/opensm/osm_ucast_ftree.c
> > +++ b/opensm/opensm/osm_ucast_ftree.c
> > @@ -2,7 +2,7 @@
> > * Copyright (c) 2009 Simula Research Laboratory. All rights reserved.
> > * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > *
> > * This software is available to you under a choice of one of two
> > @@ -1905,8 +1905,8 @@ static void set_sw_fwd_table(IN cl_map_item_t *
> const p_map_item,
> > ftree_fabric_t *p_ftree = (ftree_fabric_t *) context;
> >
> > p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid;
> > - osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr,
> > - p_sw->p_osm_sw);
> > + osm_ucast_mgr_set_fwd_tbl_top(&p_ftree->p_osm->sm.ucast_mgr,
> > + p_sw->p_osm_sw);
> > }
> >
> > /***************************************************
> > @@ -4005,6 +4005,8 @@ static int do_routing(IN void *context)
> > /* for each switch, set its fwd table */
> > cl_qmap_apply_func(&p_ftree->sw_tbl, set_sw_fwd_table, (void
> *)p_ftree);
> >
> > + osm_ucast_pipeline_tbl(&p_ftree->p_osm->sm.ucast_mgr);
> > +
> > /* write out hca ordering file */
> > fabric_dump_hca_ordering(p_ftree);
> >
> > diff --git a/opensm/opensm/osm_ucast_lash.c
> b/opensm/opensm/osm_ucast_lash.c
> > index 12b5e34..adf5f6c 100644
> > --- a/opensm/opensm/osm_ucast_lash.c
> > +++ b/opensm/opensm/osm_ucast_lash.c
> > @@ -1,6 +1,6 @@
> > /*
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > * Copyright (c) 2007 Simula Research Laboratory. All rights
> reserved.
> > * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved.
> > @@ -1045,8 +1045,11 @@ static void populate_fwd_tbls(lash_t * p_lash)
> > physical_egress_port);
> > }
> > } /* for */
> > - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> > + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
> > }
> > +
> > + osm_ucast_pipeline_tbl(&p_osm->sm.ucast_mgr);
> > +
> > OSM_LOG_EXIT(p_log);
> > }
> >
> > diff --git a/opensm/opensm/osm_ucast_mgr.c
> b/opensm/opensm/osm_ucast_mgr.c
> > index 78a7031..86d1c98 100644
> > --- a/opensm/opensm/osm_ucast_mgr.c
> > +++ b/opensm/opensm/osm_ucast_mgr.c
> > @@ -1,6 +1,6 @@
> > /*
> > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> > - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights
> reserved.
> > + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights
> reserved.
> > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> > *
> > * This software is available to you under a choice of one of two
> > @@ -315,16 +315,14 @@ Exit:
> >
> > /**********************************************************************
> > **********************************************************************/
> > -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
> > - IN osm_switch_t * p_sw)
> > +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr,
> > + IN osm_switch_t * p_sw)
> > {
> > osm_node_t *p_node;
> > osm_dr_path_t *p_path;
> > osm_madw_context_t context;
> > ib_api_status_t status;
> > ib_switch_info_t si;
> > - uint16_t block_id_ho = 0;
> > - uint8_t block[IB_SMP_DATA_SIZE];
> > boolean_t set_swinfo_require = FALSE;
> > uint16_t lin_top;
> > uint8_t life_state;
> > @@ -382,48 +380,8 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t *
> p_mgr,
> > ib_get_err_str(status));
> > }
> >
> > - /*
> > - Send linear forwarding table blocks to the switch
> > - as long as the switch indicates it has blocks needing
> > - configuration.
> > - */
> > -
> > - context.lft_context.node_guid = osm_node_get_node_guid(p_node);
> > - context.lft_context.set_method = TRUE;
> > -
> > - if (!p_sw->new_lft) {
> > - /* any routing should provide the new_lft */
> > - CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> > - p_mgr->cache_valid && !p_sw->need_update);
> > - goto Exit;
> > - }
> > -
> > - for (block_id_ho = 0;
> > - osm_switch_get_lft_block(p_sw, block_id_ho, block);
> > - block_id_ho++) {
> > - if (!p_sw->need_update && !p_mgr->p_subn->need_update &&
> > - !memcmp(block,
> > - p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
> > - IB_SMP_DATA_SIZE))
> > - continue;
> > -
> > - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> > - "Writing FT block %u\n", block_id_ho);
> > -
> > - status = osm_req_set(p_mgr->sm, p_path,
> > - p_sw->new_lft +
> > - block_id_ho * IB_SMP_DATA_SIZE,
> > - sizeof(block),
> IB_MAD_ATTR_LIN_FWD_TBL,
> > - cl_hton32(block_id_ho),
> CL_DISP_MSGID_NONE,
> > - &context);
> > + p_sw->lft_block_id_ho = 0;
> >
> > - if (status != IB_SUCCESS)
> > - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: "
> > - "Sending linear fwd. tbl. block failed
> (%s)\n",
> > - ib_get_err_str(status));
> > - }
> > -
> > -Exit:
> > OSM_LOG_EXIT(p_mgr->p_log);
> > return 0;
> > }
> > @@ -508,7 +466,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t *
> p_map_item,
> > }
> > }
> >
> > - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> > + osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
> >
> > if (p_mgr->p_subn->opt.lmc)
> > free_ports_priv(p_mgr);
> > @@ -516,6 +474,47 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t *
> p_map_item,
> > OSM_LOG_EXIT(p_mgr->p_log);
> > }
> >
> > +static void ucast_mgr_pipeline_tbl(IN osm_switch_t *p_sw,
> > + IN osm_ucast_mgr_t *p_mgr)
> > +{
> > + osm_dr_path_t *p_path;
> > + osm_madw_context_t mad_context;
> > + uint8_t block[IB_SMP_DATA_SIZE];
> > +
> > + OSM_LOG_ENTER(p_mgr->p_log);
> > +
> > + CL_ASSERT(p_sw && p_sw->p_node);
> > +
> > + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> > + "Processing switch 0x%" PRIx64 "\n",
> > + cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
> > +
> > + /*
> > + Send linear forwarding table blocks to the switch
> > + as long as the switch indicates it has blocks needing
> > + configuration.
> > + */
> > + if (!p_sw->new_lft) {
> > + /* any routing should provide the new_lft */
> > + CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> > + p_mgr->cache_valid && !p_sw->need_update);
> > + goto Exit;
> > + }
> > +
> > + p_path =
> osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> > +
> > + mad_context.lft_context.node_guid =
> osm_node_get_node_guid(p_sw->p_node);
> > + mad_context.lft_context.set_method = TRUE;
> > +
> > + osm_sm_set_next_lft_block(p_mgr->sm, p_sw, &block[0], p_path,
> > + &mad_context);
> > +
> > + p_sw->lft_block_id_ho++;
> > +
> > +Exit:
> > + OSM_LOG_EXIT(p_mgr->p_log);
> > +}
> > +
> > /**********************************************************************
> > **********************************************************************/
> > static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item,
> > @@ -870,6 +869,28 @@ static void
> sort_ports_by_switch_load(osm_ucast_mgr_t * m)
> > add_sw_endports_to_order_list(s[i], m);
> > }
> >
> > +void osm_ucast_pipeline_tbl(osm_ucast_mgr_t * p_mgr)
> > +{
> > + cl_qmap_t *p_sw_tbl;
> > + osm_switch_t *p_sw;
> > + int i;
> > +
> > + for (i = 0;
> > + !p_mgr->p_subn->opt.max_smps_per_node ||
> > + i < p_mgr->p_subn->opt.max_smps_per_node;
> > + i++) {
> > + p_mgr->sm->lfts_updated = 0;
> > + p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
> > + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
> > + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> > + ucast_mgr_pipeline_tbl(p_sw, p_mgr);
> > + p_sw = (osm_switch_t *)
> cl_qmap_next(&p_sw->map_item);
> > + }
> > + if (!p_mgr->sm->lfts_updated)
> > + break;
> > + }
> > +}
>
> Is it possible (for example in case of send errors) that "partial" LFT
> blocks sending will trigger wait_for_pending_transaction() completion?
I don't know. Is this different from the original algorithm in the case of
send errors ?
-- Hal
>
>
> Sasha
>
> > +
> > static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
> > {
> > cl_qlist_init(&p_mgr->port_order_list);
> > @@ -904,6 +925,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *
> p_mgr)
> > cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl,
> ucast_mgr_process_tbl,
> > p_mgr);
> >
> > + osm_ucast_pipeline_tbl(p_mgr);
> > +
> > cl_qlist_remove_all(&p_mgr->port_order_list);
> >
> > return 0;
> >
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20090804/8e38491b/attachment.html>
More information about the general
mailing list