[ofa-general] Re: [PATCHv2] opensm: Parallelize (Stripe) LFT sets across switches
Sasha Khapyorsky
sashak at voltaire.com
Tue Aug 4 08:27:00 PDT 2009
Hi,
On 19:28 Thu 30 Jul , Hal Rosenstock wrote:
>
> Currently, MADs are pipelined to a single switch at a time which
> effectively serializes these requests due to processing at the SMA.
> This patch pipelines (stripes) them across the switches first before
> proceeding with successive blocks. As a result of this striping,
> multiple switches can process the set and respond concurrently
> which results in an improvement to the subnet initialization time.
The idea is nice. However I have some initial comments about an
implementation.
BTW should there be a reason for an option to preserve the current
behavior? (I don't know, just asking)
> This patch also introduces a new config option (max_smps_per_node)
> which indicates how deep the per node pipeline is (current default is 4).
> This also has the effect of limiting the number of times that the switch
> list is traversed. Maybe this embellishment is unnecessary.
Then why is it needed?
> All unicast routing protocols are updated for this with the exception
> of file.
>
> A similar subsequent change will do this for MFTs.
>
> Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il> wrote:
>
> With a small cluster of 17 IS4 switches and 11 HCAs and
> to artificially increase the cluster, LMC of 7 was used
> including EnhancedSwitchPort 0 LMC.
>
> With the new code, LFT configuration is more than twice as
> fast as with the old code :)
> Current ucast manager ran on avarage for ~250msec, with the
> new code - 110-120msec.
>
> Routing calculation phase of the ucast manager took ~1200 usec,
> the rest was sending the blocks and waiting for no more pending
> transactions.
>
> No noticeable difference between various max_smps_per_node values
> was observed.
What is the reason? And what was value of 'max_wire_smps'?
> Here are some detailed results of different executions (the
> number on the left is timer value in usec):
>
> Current ucast manager (w/o the optimization):
>
> 000000 [LFT]: osm_ucast_mgr_process() - START
> 001131 [LFT]: ucast_mgr_process_tbl() - START
> 032251 [LFT]: ucast_mgr_process_tbl() - END
> 032263 [LFT]: osm_ucast_mgr_process() - END
> 253416 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=0:
>
> 001417 [LFT]: osm_ucast_mgr_process() - START (0 max_smps_per_node)
> 002690 [LFT]: ucast_mgr_process_tbl() - START
> 032946 [LFT]: ucast_mgr_process_tbl() - END
> 032948 [LFT]: osm_ucast_pipeline_tbl() - START
> 033846 [LFT]: osm_ucast_pipeline_tbl() - END
> 033858 [LFT]: osm_ucast_mgr_process() - END
> 108203 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=1:
>
> 007474 [LFT]: osm_ucast_mgr_process() - START (1 max_smps_per_node)
> 008735 [LFT]: ucast_mgr_process_tbl() - START
> 040071 [LFT]: ucast_mgr_process_tbl() - END
> 040074 [LFT]: osm_ucast_pipeline_tbl() - START
> 040103 [LFT]: osm_ucast_pipeline_tbl() - END
> 040114 [LFT]: osm_ucast_mgr_process() - END
> 120097 [LFT]: Done wait_for_pending_transactions()
>
> New code, max_smps_per_node=4:
>
> 004137 [LFT]: osm_ucast_mgr_process() - START (4 max_smps_per_node)
> 005380 [LFT]: ucast_mgr_process_tbl() - START
> 037436 [LFT]: ucast_mgr_process_tbl() - END
> 037439 [LFT]: osm_ucast_pipeline_tbl() - START
> 037495 [LFT]: osm_ucast_pipeline_tbl() - END
> 037506 [LFT]: osm_ucast_mgr_process() - END
> 114983 [LFT]: Done wait_for_pending_transactions()
>
>
> With IS3 based Qlogic switches, which do not handle DR packets forwarding
> in HW, with a fabric of ~1100 HCAs, ~280 switches:
>
> Current OSM configures LFTs in ~2 seconds.
> New algorithm does the same job in 1.4-1.6 seconds (30%-20% speed up),
> depending on the max_smps_per_node value.
>
> As in case of IS4 switches, the shortest config time was obtained with
> max_smps_per_node=0, which is unlimited pipeline.
>
>
> Signed-off-by: Hal Rosenstock <hal.rosenstock at gmail.com>
> ---
> Changes since v1:
> Added Yevgeny's performance data to patch description above
> No change to actual patch
>
> diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h
> index 0537002..617e8a9 100644
> --- a/opensm/include/opensm/osm_base.h
> +++ b/opensm/include/opensm/osm_base.h
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
> *
> @@ -449,6 +449,18 @@ BEGIN_C_DECLS
> */
> #define OSM_DEFAULT_SMP_MAX_ON_WIRE 4
> /***********/
> +/****d* OpenSM: Base/OSM_DEFAULT_SMP_MAX_PER_NODE
> +* NAME
> +* OSM_DEFAULT_SMP_MAX_PER_NODE
> +*
> +* DESCRIPTION
> +* Specifies the default number of VL15 SMP MADs allowed
> +* per node for certain attributes.
> +*
> +* SYNOPSIS
> +*/
> +#define OSM_DEFAULT_SMP_MAX_PER_NODE 4
> +/***********/
> /****d* OpenSM: Base/OSM_SM_DEFAULT_QP0_RCV_SIZE
> * NAME
> * OSM_SM_DEFAULT_QP0_RCV_SIZE
> diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h
> index cc8321d..1776380 100644
> --- a/opensm/include/opensm/osm_sm.h
> +++ b/opensm/include/opensm/osm_sm.h
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -130,6 +130,7 @@ typedef struct osm_sm {
> osm_sm_mad_ctrl_t mad_ctrl;
> osm_lid_mgr_t lid_mgr;
> osm_ucast_mgr_t ucast_mgr;
> + boolean_t lfts_updated;
The name is unclear - actually this means "update in progress".
> cl_disp_reg_handle_t sweep_fail_disp_h;
> cl_disp_reg_handle_t ni_disp_h;
> cl_disp_reg_handle_t pi_disp_h;
> @@ -524,6 +525,45 @@ osm_resp_send(IN osm_sm_t * sm,
> *
> *********/
>
> +/****f* OpenSM: SM/osm_sm_set_next_lft_block
> +* NAME
> +* osm_sm_set_next_lft_block
> +*
> +* DESCRIPTION
> +* Set the next LFT (LinearForwardingTable) block in the indicated switch.
> +*
> +* SYNOPSIS
> +*/
> +void
> +osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> + IN uint8_t *p_block, IN osm_dr_path_t *p_path,
> + IN osm_madw_context_t *p_context);
Why should it be in osm_sm.[ch]? osm_ucast_mgr.c or osm_switch.c seem
much more appropriate place for this.
> +/*
> +* PARAMETERS
> +* p_sm
> +* [in] Pointer to an osm_sm_t object.
> +*
> +* p_switch
> +* [in] Pointer to the switch object.
> +*
> +* p_block
> +* [in] Pointer to the forwarding table block.
> +*
> +* p_path
> +* [in] Pointer to a directed route path object.
> +*
> +* p_context
> +* [in] Mad wrapper context structure to be copied into the wrapper
> +* context, and thus visible to the recipient of the response.
> +*
> +* RETURN VALUES
> +* None
> +*
> +* NOTES
> +*
> +* SEE ALSO
> +*********/
> +
> /****f* OpenSM: SM/osm_sm_mcgrp_join
> * NAME
> * osm_sm_mcgrp_join
> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h
> index 59a32ad..f12afae 100644
> --- a/opensm/include/opensm/osm_subnet.h
> +++ b/opensm/include/opensm/osm_subnet.h
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved.
> *
> @@ -147,6 +147,7 @@ typedef struct osm_subn_opt {
> uint32_t sweep_interval;
> uint32_t max_wire_smps;
> uint32_t transaction_timeout;
> + uint32_t max_smps_per_node;
> uint8_t sm_priority;
> uint8_t lmc;
> boolean_t lmc_esp0;
> diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
> index 7ce28c5..e12113f 100644
> --- a/opensm/include/opensm/osm_switch.h
> +++ b/opensm/include/opensm/osm_switch.h
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -102,6 +102,7 @@ typedef struct osm_switch {
> osm_port_profile_t *p_prof;
> uint8_t *lft;
> uint8_t *new_lft;
> + uint16_t lft_block_id_ho;
> osm_mcast_tbl_t mcast_tbl;
> unsigned endport_links;
> unsigned need_update;
> diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h
> index a040476..fdea49a 100644
> --- a/opensm/include/opensm/osm_ucast_mgr.h
> +++ b/opensm/include/opensm/osm_ucast_mgr.h
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -233,17 +233,42 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN struct osm_sm * sm);
> * osm_ucast_mgr_destroy
> *********/
>
> -/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_table
> +/****f* OpenSM: Unicast Manager/osm_ucast_pipeline_tbl
> * NAME
> -* osm_ucast_mgr_set_fwd_table
> +* osm_ucast_pipeline_tbl
> *
> * DESCRIPTION
> -* Setup forwarding table for the switch (from prepared new_lft).
> +* The osm_ucast_pipeline_tbl function pipelines the LFT
> +* (LinearForwardingTable) sets across the switches
> +* (from prepared new_lft).
> *
> * SYNOPSIS
> */
> -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr,
> - IN osm_switch_t * const p_sw);
> +void osm_ucast_pipeline_tbl(IN osm_ucast_mgr_t * p_mgr);
> +/*
> +* PARAMETERS
> +* p_mgr
> +* [in] Pointer to an osm_ucast_mgr_t object.
> +*
> +* RETURN VALUES
> +* None.
> +*
> +* NOTES
> +*
> +* SEE ALSO
> +*********/
> +
> +/****f* OpenSM: Unicast Manager/osm_ucast_mgr_set_fwd_tbl_top
> +* NAME
> +* osm_ucast_mgr_set_fwd_tbl_top
> +*
> +* DESCRIPTION
> +* Setup LinearFDBTop for the switch.
> +*
> +* SYNOPSIS
> +*/
> +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * const p_mgr,
> + IN osm_switch_t * const p_sw);
I don't really like such separation (osm_ucast_mgr_set_fwd_tbl_top and
osm_ucast_pipeline_tbl). Why to not use a single function and update all
routing engines appropriately (you need to do it anyway), so that this
will only fill up new_lfts table?
> /*
> * PARAMETERS
> * p_mgr
> diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c
> index 2edb8d3..cb131b4 100644
> --- a/opensm/opensm/osm_lin_fwd_rcv.c
> +++ b/opensm/opensm/osm_lin_fwd_rcv.c
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -36,7 +36,7 @@
> /*
> * Abstract:
> * Implementation of osm_lft_rcv_t.
> - * This object represents the NodeDescription Receiver object.
> + * This object represents the Linear Forwarding Table Receiver object.
> * This object is part of the opensm family of objects.
> */
>
> @@ -55,6 +55,7 @@ void osm_lft_rcv_process(IN void *context, IN void *data)
> {
> osm_sm_t *sm = context;
> osm_madw_t *p_madw = data;
> + osm_dr_path_t *p_path;
> ib_smp_t *p_smp;
> uint32_t block_num;
> osm_switch_t *p_sw;
> @@ -62,6 +63,8 @@ void osm_lft_rcv_process(IN void *context, IN void *data)
> uint8_t *p_block;
> ib_net64_t node_guid;
> ib_api_status_t status;
> + uint8_t block[IB_SMP_DATA_SIZE];
> + osm_madw_context_t mad_context;
>
> CL_ASSERT(sm);
>
> @@ -94,6 +97,16 @@ void osm_lft_rcv_process(IN void *context, IN void *data)
> "\n\t\t\t\tSwitch 0x%" PRIx64 "\n",
> ib_get_err_str(status), cl_ntoh64(node_guid));
> }
> +
> + p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> +
> + mad_context.lft_context.node_guid = node_guid;
> + mad_context.lft_context.set_method = TRUE;
> +
> + osm_sm_set_next_lft_block(sm, p_sw, &block[0], p_path,
> + &mad_context);
> +
> + p_sw->lft_block_id_ho++;
Wouldn't it be simpler to encode block_id in a mad context?
> }
>
> CL_PLOCK_RELEASE(sm->p_lock);
> diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c
> index daa60ff..4e0fd2a 100644
> --- a/opensm/opensm/osm_sm.c
> +++ b/opensm/opensm/osm_sm.c
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved.
> *
> @@ -441,6 +441,45 @@ Exit:
>
> /**********************************************************************
> **********************************************************************/
> +void osm_sm_set_next_lft_block(IN osm_sm_t *p_sm, IN osm_switch_t *p_sw,
> + IN uint8_t *p_block, IN osm_dr_path_t *p_path,
> + IN osm_madw_context_t *context)
> +{
> + ib_api_status_t status;
> +
> + for (;
> + osm_switch_get_lft_block(p_sw, p_sw->lft_block_id_ho, p_block);
> + p_sw->lft_block_id_ho++) {
> + if (!p_sw->need_update && !p_sm->p_subn->need_update &&
> + !memcmp(p_block,
> + p_sw->new_lft + p_sw->lft_block_id_ho * IB_SMP_DATA_SIZE,
> + IB_SMP_DATA_SIZE))
> + continue;
> +
> + p_sm->lfts_updated = 1;
> +
> + OSM_LOG(p_sm->p_log, OSM_LOG_DEBUG,
> + "Writing FT block %u to switch 0x%" PRIx64 "\n",
> + p_sw->lft_block_id_ho,
> + cl_ntoh64(context->lft_context.node_guid));
> +
> + status = osm_req_set(p_sm, p_path,
> + p_sw->new_lft +
> + p_sw->lft_block_id_ho * IB_SMP_DATA_SIZE,
> + IB_SMP_DATA_SIZE, IB_MAD_ATTR_LIN_FWD_TBL,
> + cl_hton32(p_sw->lft_block_id_ho),
> + CL_DISP_MSGID_NONE, context);
> +
> + if (status != IB_SUCCESS)
> + OSM_LOG(p_sm->p_log, OSM_LOG_ERROR, "ERR 2E11: "
> + "Sending linear fwd. tbl. block failed (%s)\n",
> + ib_get_err_str(status));
> + break;
> + }
> +}
> +
> +/**********************************************************************
> + **********************************************************************/
> static ib_api_status_t sm_mgrp_process(IN osm_sm_t * p_sm,
> IN osm_mgrp_t * p_mgrp)
> {
> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c
> index ec15f8a..1964b7f 100644
> --- a/opensm/opensm/osm_subnet.c
> +++ b/opensm/opensm/osm_subnet.c
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved.
> *
> @@ -295,6 +295,7 @@ static const opt_rec_t opt_tbl[] = {
> { "m_key_lease_period", OPT_OFFSET(m_key_lease_period), opts_parse_net16, NULL, 1 },
> { "sweep_interval", OPT_OFFSET(sweep_interval), opts_parse_uint32, NULL, 1 },
> { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_parse_uint32, NULL, 1 },
> + { "max_smps_per_node", OPT_OFFSET(max_smps_per_node), opts_parse_uint32, NULL, 1 },
> { "console", OPT_OFFSET(console), opts_parse_charp, NULL, 0 },
> { "console_port", OPT_OFFSET(console_port), opts_parse_uint16, NULL, 0 },
> { "transaction_timeout", OPT_OFFSET(transaction_timeout), opts_parse_uint32, NULL, 1 },
> @@ -671,6 +672,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt)
> p_opt->m_key_lease_period = 0;
> p_opt->sweep_interval = OSM_DEFAULT_SWEEP_INTERVAL_SECS;
> p_opt->max_wire_smps = OSM_DEFAULT_SMP_MAX_ON_WIRE;
> + p_opt->max_smps_per_node = OSM_DEFAULT_SMP_MAX_PER_NODE;
> p_opt->console = strdup(OSM_DEFAULT_CONSOLE);
> p_opt->console_port = OSM_DEFAULT_CONSOLE_PORT;
> p_opt->transaction_timeout = OSM_DEFAULT_TRANS_TIMEOUT_MILLISEC;
> @@ -1461,6 +1463,10 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
> "max_wire_smps %u\n\n"
> "# The maximum time in [msec] allowed for a transaction to complete\n"
> "transaction_timeout %u\n\n"
> + "# Maximum number of SMPs per node sent in parallel\n"
> + "# (0 means unlimited)\n"
> + "# Only applies to certain attributes\n"
> + "max_smps_per_node %u\n\n"
> "# Maximal time in [msec] a message can stay in the incoming message queue.\n"
> "# If there is more than one message in the queue and the last message\n"
> "# stayed in the queue more than this value, any SA request will be\n"
> @@ -1470,6 +1476,7 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts)
> "single_thread %s\n\n",
> p_opts->max_wire_smps,
> p_opts->transaction_timeout,
> + p_opts->max_smps_per_node,
> p_opts->max_msg_fifo_timeout,
> p_opts->single_thread ? "TRUE" : "FALSE");
>
> diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c
> index 216b496..31c930b 100644
> --- a/opensm/opensm/osm_ucast_cache.c
> +++ b/opensm/opensm/osm_ucast_cache.c
> @@ -1,5 +1,5 @@
> /*
> - * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2008,2009 Mellanox Technologies LTD. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> * licenses. You may choose to be licensed under the terms of the GNU
> @@ -1085,9 +1085,11 @@ int osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr)
> memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1);
> }
>
> - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> + osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
> }
>
> + osm_ucast_pipeline_tbl(p_mgr);
> +
> return 0;
> }
>
> diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c
> index 2505c46..099e8ba 100644
> --- a/opensm/opensm/osm_ucast_file.c
> +++ b/opensm/opensm/osm_ucast_file.c
> @@ -168,8 +168,8 @@ static int do_ucast_file_load(void *context)
> "routing algorithm\n");
> } else if (!strncmp(p, "Unicast lids", 12)) {
> if (p_sw)
> - osm_ucast_mgr_set_fwd_table(&p_osm->sm.
> - ucast_mgr, p_sw);
> + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.
> + ucast_mgr, p_sw);
> q = strstr(p, " guid 0x");
> if (!q) {
> OSM_LOG(&p_osm->log, OSM_LOG_ERROR,
> @@ -247,7 +247,7 @@ static int do_ucast_file_load(void *context)
> }
>
> if (p_sw)
> - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
>
> fclose(file);
> return 0;
I suppose that this breaks 'file' routing engine (did you test it?) -
instead of switch LFTs setup this will only update its TOPs.
> diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
> index bde6dbd..d65c685 100644
> --- a/opensm/opensm/osm_ucast_ftree.c
> +++ b/opensm/opensm/osm_ucast_ftree.c
> @@ -2,7 +2,7 @@
> * Copyright (c) 2009 Simula Research Laboratory. All rights reserved.
> * Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -1905,8 +1905,8 @@ static void set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
> ftree_fabric_t *p_ftree = (ftree_fabric_t *) context;
>
> p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid;
> - osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr,
> - p_sw->p_osm_sw);
> + osm_ucast_mgr_set_fwd_tbl_top(&p_ftree->p_osm->sm.ucast_mgr,
> + p_sw->p_osm_sw);
> }
>
> /***************************************************
> @@ -4005,6 +4005,8 @@ static int do_routing(IN void *context)
> /* for each switch, set its fwd table */
> cl_qmap_apply_func(&p_ftree->sw_tbl, set_sw_fwd_table, (void *)p_ftree);
>
> + osm_ucast_pipeline_tbl(&p_ftree->p_osm->sm.ucast_mgr);
> +
> /* write out hca ordering file */
> fabric_dump_hca_ordering(p_ftree);
>
> diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c
> index 12b5e34..adf5f6c 100644
> --- a/opensm/opensm/osm_ucast_lash.c
> +++ b/opensm/opensm/osm_ucast_lash.c
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> * Copyright (c) 2007 Simula Research Laboratory. All rights reserved.
> * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved.
> @@ -1045,8 +1045,11 @@ static void populate_fwd_tbls(lash_t * p_lash)
> physical_egress_port);
> }
> } /* for */
> - osm_ucast_mgr_set_fwd_table(&p_osm->sm.ucast_mgr, p_sw);
> + osm_ucast_mgr_set_fwd_tbl_top(&p_osm->sm.ucast_mgr, p_sw);
> }
> +
> + osm_ucast_pipeline_tbl(&p_osm->sm.ucast_mgr);
> +
> OSM_LOG_EXIT(p_log);
> }
>
> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
> index 78a7031..86d1c98 100644
> --- a/opensm/opensm/osm_ucast_mgr.c
> +++ b/opensm/opensm/osm_ucast_mgr.c
> @@ -1,6 +1,6 @@
> /*
> * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved.
> - * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved.
> + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
> * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> @@ -315,16 +315,14 @@ Exit:
>
> /**********************************************************************
> **********************************************************************/
> -int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
> - IN osm_switch_t * p_sw)
> +int osm_ucast_mgr_set_fwd_tbl_top(IN osm_ucast_mgr_t * p_mgr,
> + IN osm_switch_t * p_sw)
> {
> osm_node_t *p_node;
> osm_dr_path_t *p_path;
> osm_madw_context_t context;
> ib_api_status_t status;
> ib_switch_info_t si;
> - uint16_t block_id_ho = 0;
> - uint8_t block[IB_SMP_DATA_SIZE];
> boolean_t set_swinfo_require = FALSE;
> uint16_t lin_top;
> uint8_t life_state;
> @@ -382,48 +380,8 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * p_mgr,
> ib_get_err_str(status));
> }
>
> - /*
> - Send linear forwarding table blocks to the switch
> - as long as the switch indicates it has blocks needing
> - configuration.
> - */
> -
> - context.lft_context.node_guid = osm_node_get_node_guid(p_node);
> - context.lft_context.set_method = TRUE;
> -
> - if (!p_sw->new_lft) {
> - /* any routing should provide the new_lft */
> - CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> - p_mgr->cache_valid && !p_sw->need_update);
> - goto Exit;
> - }
> -
> - for (block_id_ho = 0;
> - osm_switch_get_lft_block(p_sw, block_id_ho, block);
> - block_id_ho++) {
> - if (!p_sw->need_update && !p_mgr->p_subn->need_update &&
> - !memcmp(block,
> - p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE,
> - IB_SMP_DATA_SIZE))
> - continue;
> -
> - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> - "Writing FT block %u\n", block_id_ho);
> -
> - status = osm_req_set(p_mgr->sm, p_path,
> - p_sw->new_lft +
> - block_id_ho * IB_SMP_DATA_SIZE,
> - sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL,
> - cl_hton32(block_id_ho), CL_DISP_MSGID_NONE,
> - &context);
> + p_sw->lft_block_id_ho = 0;
>
> - if (status != IB_SUCCESS)
> - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: "
> - "Sending linear fwd. tbl. block failed (%s)\n",
> - ib_get_err_str(status));
> - }
> -
> -Exit:
> OSM_LOG_EXIT(p_mgr->p_log);
> return 0;
> }
> @@ -508,7 +466,7 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
> }
> }
>
> - osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
> + osm_ucast_mgr_set_fwd_tbl_top(p_mgr, p_sw);
>
> if (p_mgr->p_subn->opt.lmc)
> free_ports_priv(p_mgr);
> @@ -516,6 +474,47 @@ static void ucast_mgr_process_tbl(IN cl_map_item_t * p_map_item,
> OSM_LOG_EXIT(p_mgr->p_log);
> }
>
> +static void ucast_mgr_pipeline_tbl(IN osm_switch_t *p_sw,
> + IN osm_ucast_mgr_t *p_mgr)
> +{
> + osm_dr_path_t *p_path;
> + osm_madw_context_t mad_context;
> + uint8_t block[IB_SMP_DATA_SIZE];
> +
> + OSM_LOG_ENTER(p_mgr->p_log);
> +
> + CL_ASSERT(p_sw && p_sw->p_node);
> +
> + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG,
> + "Processing switch 0x%" PRIx64 "\n",
> + cl_ntoh64(osm_node_get_node_guid(p_sw->p_node)));
> +
> + /*
> + Send linear forwarding table blocks to the switch
> + as long as the switch indicates it has blocks needing
> + configuration.
> + */
> + if (!p_sw->new_lft) {
> + /* any routing should provide the new_lft */
> + CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache &&
> + p_mgr->cache_valid && !p_sw->need_update);
> + goto Exit;
> + }
> +
> + p_path = osm_physp_get_dr_path_ptr(osm_node_get_physp_ptr(p_sw->p_node, 0));
> +
> + mad_context.lft_context.node_guid = osm_node_get_node_guid(p_sw->p_node);
> + mad_context.lft_context.set_method = TRUE;
> +
> + osm_sm_set_next_lft_block(p_mgr->sm, p_sw, &block[0], p_path,
> + &mad_context);
> +
> + p_sw->lft_block_id_ho++;
> +
> +Exit:
> + OSM_LOG_EXIT(p_mgr->p_log);
> +}
> +
> /**********************************************************************
> **********************************************************************/
> static void ucast_mgr_process_neighbors(IN cl_map_item_t * p_map_item,
> @@ -870,6 +869,28 @@ static void sort_ports_by_switch_load(osm_ucast_mgr_t * m)
> add_sw_endports_to_order_list(s[i], m);
> }
>
> +void osm_ucast_pipeline_tbl(osm_ucast_mgr_t * p_mgr)
> +{
> + cl_qmap_t *p_sw_tbl;
> + osm_switch_t *p_sw;
> + int i;
> +
> + for (i = 0;
> + !p_mgr->p_subn->opt.max_smps_per_node ||
> + i < p_mgr->p_subn->opt.max_smps_per_node;
> + i++) {
> + p_mgr->sm->lfts_updated = 0;
> + p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl;
> + p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl);
> + while (p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl)) {
> + ucast_mgr_pipeline_tbl(p_sw, p_mgr);
> + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item);
> + }
> + if (!p_mgr->sm->lfts_updated)
> + break;
> + }
> +}
Is it possible (for example in case of send errors) that "partial" LFT
blocks sending will trigger wait_for_pending_transaction() completion?
Sasha
> +
> static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
> {
> cl_qlist_init(&p_mgr->port_order_list);
> @@ -904,6 +925,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t * p_mgr)
> cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, ucast_mgr_process_tbl,
> p_mgr);
>
> + osm_ucast_pipeline_tbl(p_mgr);
> +
> cl_qlist_remove_all(&p_mgr->port_order_list);
>
> return 0;
>
More information about the general
mailing list