From kliteyn at dev.mellanox.co.il Thu Jan 1 00:57:26 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 01 Jan 2009 10:57:26 +0200 Subject: [ofa-general] [PATCH] opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 is active Message-ID: <495C8576.9060004@dev.mellanox.co.il> Hi Sasha, When switch is coming up after reset, port 0 always reports logical state ACTIVE. OpenSM shouldn't clear sw->need_update flag because of port 0. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_port_info_rcv.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 8763b87..02ad586 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -317,7 +317,7 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, } if (ib_port_info_get_port_state(p_pi) > IB_LINK_INIT && p_node->sw && - p_node->sw->need_update == 1) + p_node->sw->need_update == 1 && port_num != 0) p_node->sw->need_update = 0; if (p_physp->need_update) -- 1.5.1.4 From vlad at lists.openfabrics.org Thu Jan 1 03:12:07 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 1 Jan 2009 03:12:07 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090101-0200 daily build status Message-ID: <20090101111207.85F32E60E44@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From dorfman.eli at gmail.com Thu Jan 1 03:22:11 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 01 Jan 2009 13:22:11 +0200 Subject: [ofa-general] [PATCH] OpenSM: update osmeventplugin example for the new TRAP event. In-Reply-To: <495A20D9.1020509@ext.bull.net> References: <20081218164813.55696c45.weiny2@llnl.gov> <495A20D9.1020509@ext.bull.net> Message-ID: <495CA763.3010507@gmail.com> Nicolas Morey Chaisemartin wrote: > Hello, > > I was wondering if there is a doc somewhere with a list of the trap > codes (for generic traps) and what is stored into the associated > ib_mad_notice_attr_t structure? It is documented in the IB spec "14.2.5.1 NOTICES AND TRAPS". And in management/opensm/include/iba/ib_types.h > I'm a writing a perf manager plugin for OpenSM (originally based on > opensmskumme) and I'd like to handle TRAP events. > The problem is without the list of trap IDs and their meaning, I'm not > really sure how to handle them, and what to store in the database. > > Thanks > > Nicolas > (and by the way Happy New Year to everyone) > > Ira Weiny wrote: >> It turns out that I already was using the "OSM_EVENT_ID_TRAP" in the >> example >> plugin. >> >> This makes the use work, >> Ira >> >> >> >From 7b744c38fc2aad67586ade81d65326a139a85681 Mon Sep 17 00:00:00 2001 >> From: Ira Weiny >> Date: Thu, 18 Dec 2008 16:16:37 -0800 >> Subject: [PATCH] OpenSM: update osmeventplugin example for the new >> TRAP event. >> >> >> Signed-off-by: Ira Weiny >> --- >> opensm/include/opensm/osm_event_plugin.h | 12 ------------ >> opensm/osmeventplugin/src/osmeventplugin.c | 28 >> ++++++++++++++++++++-------- >> 2 files changed, 20 insertions(+), 20 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_event_plugin.h >> b/opensm/include/opensm/osm_event_plugin.h >> index 0922c65..41a5810 100644 >> --- a/opensm/include/opensm/osm_event_plugin.h >> +++ b/opensm/include/opensm/osm_event_plugin.h >> @@ -131,18 +131,6 @@ typedef struct osm_api_ps_event { >> } osm_epi_ps_event_t; >> >> /** >> ========================================================================= >> - * Trap events >> - */ >> -typedef struct osm_epi_trap_event { >> - osm_epi_port_id_t port_id; >> - uint8_t type; >> - uint32_t prod_type; >> - uint16_t trap_num; >> - uint16_t issuer_lid; >> - time_t time; >> -} osm_epi_trap_event_t; >> - >> -/** >> ========================================================================= >> * Plugin creators should allocate an object of this type >> * (named OSM_EVENT_PLUGIN_IMPL_NAME) >> * The version should be set to OSM_EVENT_PLUGIN_INTERFACE_VER >> diff --git a/opensm/osmeventplugin/src/osmeventplugin.c >> b/opensm/osmeventplugin/src/osmeventplugin.c >> index f0781eb..b4d9ce9 100644 >> --- a/opensm/osmeventplugin/src/osmeventplugin.c >> +++ b/opensm/osmeventplugin/src/osmeventplugin.c >> @@ -137,13 +137,21 @@ static void handle_port_select(_log_events_t * >> log, osm_epi_ps_event_t * ps) >> >> /** >> ========================================================================= >> */ >> -static void handle_trap_event(_log_events_t * log, >> osm_epi_trap_event_t * trap) >> +static void handle_trap_event(_log_events_t *log, >> ib_mad_notice_attr_t *p_ntc) >> { >> - fprintf(log->log_file, >> - "Trap event %d from 0x%" PRIx64 " (%s) port %d\n", >> - trap->trap_num, >> - trap->port_id.node_guid, >> - trap->port_id.node_name, trap->port_id.port_num); >> + if (ib_notice_is_generic(p_ntc)) { >> + fprintf(log->log_file, >> + "Generic trap type %d; event %d; from LID 0x%x\n", >> + ib_notice_get_type(p_ntc), >> + cl_ntoh16(p_ntc->g_or_v.generic.trap_num), >> + cl_ntoh16(p_ntc->issuer_lid)); >> + } else { >> + fprintf(log->log_file, >> + "Vendor trap type %d; from LID 0x%x\n", >> + ib_notice_get_type(p_ntc), >> + cl_ntoh16(p_ntc->issuer_lid)); >> + } >> + >> } >> >> /** >> ========================================================================= >> @@ -163,13 +171,17 @@ static void report(void *_log, >> osm_epi_event_id_t event_id, void *event_data) >> handle_port_select(log, (osm_epi_ps_event_t *) event_data); >> break; >> case OSM_EVENT_ID_TRAP: >> - handle_trap_event(log, (osm_epi_trap_event_t *) event_data); >> + handle_trap_event(log, (ib_mad_notice_attr_t *) event_data); >> + break; >> + case OSM_EVENT_ID_SUBNET_UP: >> + fprintf(log->log_file, "Subnet up reported\n"); >> break; >> case OSM_EVENT_ID_MAX: >> default: >> osm_log(log->osmlog, OSM_LOG_ERROR, >> - "Unknown event reported to plugin\n"); >> + "Unknown event (%d) reported to plugin\n", event_id); >> } >> + fflush(log->log_file); >> } >> >> /** >> ========================================================================= >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From dorfman.eli at gmail.com Thu Jan 1 07:40:40 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 01 Jan 2009 17:40:40 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] opensm: Add new partition keyword for all switches and hca. Message-ID: <495CE3F8.9080506@gmail.com> Add new partition keyword for all switches and hca. To allow firmware upgrade within managed switches we want all switch port 0 to have full membership. 'ALL_SWITCH' means all switch end ports in the subnet 'ALL_CA' means all CA end ports in the subnet. New default partition configuration will be: "Default=0x7fff,ipoib:ALL_CA, ALL_SWITCH=full, SELF=full;" Signed-off-by: Eli Dorfman --- opensm/opensm/osm_prtn.c | 15 +++++++++------ opensm/opensm/osm_prtn_config.c | 10 ++++++++-- 2 files changed, 17 insertions(+), 8 deletions(-) diff --git a/opensm/opensm/osm_prtn.c b/opensm/opensm/osm_prtn.c index be51410..8b9301e 100644 --- a/opensm/opensm/osm_prtn.c +++ b/opensm/opensm/osm_prtn.c @@ -135,7 +135,7 @@ ib_api_status_t osm_prtn_add_port(osm_log_t * p_log, osm_subn_t * p_subn, } ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, osm_subn_t * p_subn, - osm_prtn_t * p, boolean_t full) + osm_prtn_t * p, uint8_t type, boolean_t full) { cl_qmap_t *p_port_tbl = &p_subn->port_guid_tbl; cl_map_item_t *p_item; @@ -146,10 +146,13 @@ ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, osm_subn_t * p_subn, while (p_item != cl_qmap_end(p_port_tbl)) { p_port = (osm_port_t *) p_item; p_item = cl_qmap_next(p_item); - status = osm_prtn_add_port(p_log, p_subn, p, - osm_port_get_guid(p_port), full); - if (status != IB_SUCCESS) - goto _err; + if (type == 0xff || + (osm_node_get_type(p_port->p_node) == type)) { + status = osm_prtn_add_port(p_log, p_subn, p, + osm_port_get_guid(p_port), full); + if (status != IB_SUCCESS) + goto _err; + } } _err: @@ -325,7 +328,7 @@ static ib_api_status_t osm_prtn_make_default(osm_log_t * const p_log, IB_DEFAULT_PARTIAL_PKEY); if (!p) goto _err; - status = osm_prtn_add_all(p_log, p_subn, p, no_config); + status = osm_prtn_add_all(p_log, p_subn, p, 0xff, no_config); if (status != IB_SUCCESS) goto _err; cl_map_remove(&p->part_guid_tbl, p_subn->sm_port_guid); diff --git a/opensm/opensm/osm_prtn_config.c b/opensm/opensm/osm_prtn_config.c index 9511608..37f2bd6 100644 --- a/opensm/opensm/osm_prtn_config.c +++ b/opensm/opensm/osm_prtn_config.c @@ -64,7 +64,7 @@ extern osm_prtn_t *osm_prtn_make_new(osm_log_t * p_log, osm_subn_t * p_subn, const char *name, uint16_t pkey); extern ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, osm_subn_t * p_subn, - osm_prtn_t * p, boolean_t full); + osm_prtn_t * p, uint8_t type, boolean_t full); extern ib_api_status_t osm_prtn_add_port(osm_log_t * p_log, osm_subn_t * p_subn, osm_prtn_t * p, ib_net64_t guid, boolean_t full); @@ -212,7 +212,13 @@ static int partition_add_port(unsigned lineno, struct part_conf *conf, if (!strncmp(name, "ALL", strlen(name))) { return osm_prtn_add_all(conf->p_log, conf->p_subn, p, - full) == IB_SUCCESS ? 0 : -1; + 0xff, full) == IB_SUCCESS ? 0 : -1; + } else if (!strncmp(name, "ALL_SWITCH", strlen(name))) { + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, + IB_NODE_TYPE_SWITCH, full) == IB_SUCCESS ? 0 : -1; + } else if (!strncmp(name, "ALL_CA", strlen(name))) { + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, + IB_NODE_TYPE_CA, full) == IB_SUCCESS ? 0 : -1; } else if (!strncmp(name, "SELF", strlen(name))) { guid = cl_ntoh64(conf->p_subn->sm_port_guid); } else { -- 1.5.5 From dorfman.eli at gmail.com Thu Jan 1 08:25:36 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 01 Jan 2009 18:25:36 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_prtn.c set switch end ports to full member in default partition configuration Message-ID: <495CEE80.1090006@gmail.com> set switch end ports to full member in default partition configuration. Signed-off-by: Eli Dorfman --- opensm/opensm/osm_prtn.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_prtn.c b/opensm/opensm/osm_prtn.c index 8b9301e..21c7add 100644 --- a/opensm/opensm/osm_prtn.c +++ b/opensm/opensm/osm_prtn.c @@ -331,6 +331,9 @@ static ib_api_status_t osm_prtn_make_default(osm_log_t * const p_log, status = osm_prtn_add_all(p_log, p_subn, p, 0xff, no_config); if (status != IB_SUCCESS) goto _err; + status = osm_prtn_add_all(p_log, p_subn, p, IB_NODE_TYPE_SWITCH, TRUE); + if (status != IB_SUCCESS) + goto _err; cl_map_remove(&p->part_guid_tbl, p_subn->sm_port_guid); status = osm_prtn_add_port(p_log, p_subn, p, p_subn->sm_port_guid, TRUE); -- 1.5.5 From vlad at lists.openfabrics.org Fri Jan 2 03:12:19 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 2 Jan 2009 03:12:19 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090102-0200 daily build status Message-ID: <20090102111219.626F5E60F34@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From alex.estrin at qlogic.com Fri Jan 2 08:39:47 2009 From: alex.estrin at qlogic.com (Alex Estrin) Date: Fri, 2 Jan 2009 10:39:47 -0600 Subject: [ofa-general] [PATCH] ipoib: failure during startup wiith non-default pkey set. Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624757@MNEXMB1.qlogic.org> It seems this patch was missed. Reposting. Ipoib uses the first pkey in the pkey table for its ib0 interface. Race is possible during bootup, when ipoib starts before the port is Active and has a default pkey table with 0xffff as the only pkey. SM can program the pkey table differently when moves port to Active, but at this point ipoib already started using default pkey. However there is no race, if ipoib started after the port is Active, then ipoib will find the first pkey as the SM programmed it. Proposed patch will delay ib0 interface initialization until port moved to Active state. After port is Active, interface will pickup correct pkey, then adjust broadcast gid before it joined broadcast group. Please note, the patch is not intended to touch sub-interfaces with locally programmed pkeys. Signed-off-by: Alex Estrin diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 784c291..459e2b9 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -719,7 +719,25 @@ int ipoib_ib_dev_open(struct net_device *dev) static void ipoib_pkey_dev_check_presence(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_port_attr port_attr; u16 pkey_index = 0; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + if (ib_query_port(priv->ca, priv->port, &port_attr)) { + ipoib_warn(priv, "Query port attrs failed\n"); + return; + } + if (port_attr.state != IB_PORT_ACTIVE) { + return; + } + if (ib_query_pkey(priv->ca, priv->port, 0, &priv->pkey)) { + ipoib_warn(priv, "Query P_Key table entry 0 failed\n"); + return; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + } if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 016a057..4d270e2 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -556,6 +556,13 @@ void ipoib_mcast_join_task(struct work_struct *work) } spin_lock_irq(&priv->lock); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + /* fix broadcast gid in case if pkey was changed */ + priv->pkey |= 0x8000; + priv->dev->broadcast[8] = priv->pkey >> 8; + priv->dev->broadcast[9] = priv->pkey & 0xff; + } memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid)); priv->broadcast = broadcast; -------------- next part -------------- A non-text attachment was scrubbed... Name: ipoib_pkey_bootup_race.patch Type: application/octet-stream Size: 1805 bytes Desc: ipoib_pkey_bootup_race.patch URL: From vlad at lists.openfabrics.org Sat Jan 3 03:12:02 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 3 Jan 2009 03:12:02 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090103-0200 daily build status Message-ID: <20090103111202.AF157E60908@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From fengfeixuewu at 163.com Sun Jan 4 00:03:48 2009 From: fengfeixuewu at 163.com (songzhonglei) Date: Sun, 4 Jan 2009 16:03:48 +0800 Subject: [ofa-general] infiniband on debian Message-ID: <200901041603447034776@163.com> hi, i wanted to enable ipoib on debian,so recompiled kernel with infiniband supported. after reboot i modprobe ib_ipoib lsmod : ib_ipoib,ib_sa,ib_mthca,ib_mad,ib_core. and then ifconfig ib0 x.x.x.x.->ifconfig ib0 up,but failed. dmesg showed: ADDRCONF(NETDEV_UP):ib0:link is not ready ib0:multicast join failed for ff12:401b:ffff:0000:0000............................,status-22 what's wrong?and what i should do if i want to use ipoib on debian? thank you. 2009-01-04 songzhonglei -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Sun Jan 4 03:17:51 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 4 Jan 2009 03:17:51 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090104-0200 daily build status Message-ID: <20090104111752.3DE10E60E7D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From fengfeixuewu at 163.com Sat Jan 3 23:54:12 2009 From: fengfeixuewu at 163.com (songzhonglei) Date: Sun, 4 Jan 2009 15:54:12 +0800 Subject: [ofa-general] [openib-general] Infiniband on Debian etch RC1 Message-ID: <200901041554104215734@163.com> hi, i wanted to enable ipoib on debian,so recompiled kernel with infiniband supported. after reboot i modprobe ib_ipoib lsmod : ib_ipoib,ib_sa,ib_mthca,ib_mad,ib_core. and then ifconfig ib0 x.x.x.x.->ifconfig ib0 up,but failed. dmesg showed: ADDRCONF(NETDEV_UP):ib0:link is not ready ib0:multicast join failed for ff12:401b:ffff:0000:0000............................,status-22 what's wrong?and what i should do if i want to use ipoib on debian? thank you. 2009-01-04 songzhonglei -------------- next part -------------- An HTML attachment was scrubbed... URL: From venkatvenkatsubra at yahoo.com Sun Jan 4 07:48:29 2009 From: venkatvenkatsubra at yahoo.com (Venkat Venkatsubra) Date: Sun, 4 Jan 2009 07:48:29 -0800 (PST) Subject: ***SPAM*** Re: [ofa-general] [RDMA CM IPv6 PATCHv7 2/2] RDMA CM Message-ID: <579253.6715.qm@web58304.mail.re3.yahoo.com> To return  the error on the active side that "this iWARP device doesn't support IPv6", the following places are the possibility -rdma_connect() -rdma_resolve_addr() -qp setup time   rdma_resolve_addr() would probably be the earliest to handle this. For the driver to expose this lack of capability so that rdma_resolve_addr can figure out,  we have the following structures available at the rdma_resolve_addr time      -ib_device (pointed to by cma_device) -iw_cm_verbs (pointed to by ib_device) -net_device If we don’t want to touch net_device and if this information can be made available through ib_device it might be simpler.   If the IB devices won't need to differentiate between the availability of IPv4 and IPv6 capability and only iWARP devices might require it, then Roland was suggesting iw_cm_verbs might be one path out.   Seeking suggestions.   Thanks!   Venkat ________________________________ From: Venkat Venkatsubra To: Roland Dreier ; Aleksey Senin Cc: Olga Shern ; "general at lists.openfabrics.org" Sent: Tuesday, December 30, 2008 10:24:35 AM Subject: Re: [ofa-general] [RDMA CM IPv6 PATCHv7 2/2] RDMA CM I had couple of questions regarding RDMA CM supporting IPv6. When an iWARP NIC doesn't support IPv6, what is the earliest an error could be returned saying feature unsupported ? Any time sooner than rdma_connect() for the active connect side ? And what about the passive side ? Venkat -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Sun Jan 4 20:49:39 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 04 Jan 2009 20:49:39 -0800 Subject: [ofa-general] I'm not going to fix stupid problems in patches any more Message-ID: Maybe I'm just in a cranky mood after the holidays, but I'm fed up with spending my time fixing stupid problems with patches sent to me. It seems the only way that I can teach people is to reject patches. So if you send me a patch that I can't apply because of some formatting problem that you should never have made, I'm just going to tell you so and ask you to resend the patch. So if checkpatch.pl spots obvious problems with your patch, or if you format your email so that I have to edit out duplicate subject lines or other crap from the body, or if there are any of the innumerable other problems I complain about over and over, then that patch doesn't get applied. I'm not talking about grammatical errors in the changelog or anything like that; a few honest mistakes here and there I can live with. But if you don't even try to give a changelog that gives enough information for me to evaluate your patch, and also enough for someone reading the patch a few years from now to have a chance at understanding it, expect me to bounce it back to you. If you learn to use tools like git-send-email or whatever other automation you prefer, sending patches properly should actually be less work for you too. So in addition to saving my time and improving my mood, this should also save your time as well (aside from the time spent learning the tools). Thanks! Roland From rdreier at cisco.com Sun Jan 4 21:00:11 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 04 Jan 2009 21:00:11 -0800 Subject: [ofa-general] Re: [PATCH v3] ipoib: do not join broadcast group if interface is brought down In-Reply-To: <495B2C60.6020008@Voltaire.COM> (Yossi Etigin's message of "Wed, 31 Dec 2008 10:25:04 +0200") References: <495B2C60.6020008@Voltaire.COM> Message-ID: So what protects priv->broadcast? It seems that the only lock taken when setting broadcast to NULL is priv->lock. But eg here: - priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); + if (priv->broadcast) + priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); what prevents broadcast from becoming NULL right after it's tested? Also + spin_lock_irq(&priv->lock); + if (priv->broadcast && + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) ipoib_mcast_join(dev, priv->broadcast, 0); doesn't ipoib_mcast_join() do GFP_KERNEL stuff, which would be a problem inside a spinlock? (Have you tested this with lockdep turned on?) - R. From rdreier at cisco.com Sun Jan 4 21:03:20 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 04 Jan 2009 21:03:20 -0800 Subject: [ofa-general] Re: [PATCH] mlx4_ib: fix for bugzilla 1383 (LSO packet processing) In-Reply-To: (Roland Dreier's message of "Tue, 30 Dec 2008 15:00:02 -0800") References: <200812291223.11753.jackm@dev.mellanox.co.il> <200812301920.50336.jackm@dev.mellanox.co.il> Message-ID: So do you think my patch (which avoids all the code duplication and goto) is OK, or is that still too much overhead? Keep in mind the global impact of a larger I-cache footprint because of the code duplication... - R. From rdreier at cisco.com Sun Jan 4 20:39:22 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 04 Jan 2009 20:39:22 -0800 Subject: [ofa-general] [PATCH] ipoib: failure during startup wiith non-default pkey set. References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624757@MNEXMB1.qlogic.org> Message-ID: a) checkpatch.pl says: ERROR: trailing whitespace #11: FILE: drivers/infiniband/ulp/ipoib/ipoib_ib.c:724: +^I$ ERROR: trailing whitespace #13: FILE: drivers/infiniband/ulp/ipoib/ipoib_ib.c:726: +^I^I$ WARNING: braces {} are not necessary for single statement blocks #19: FILE: drivers/infiniband/ulp/ipoib/ipoib_ib.c:732: + if (port_attr.state != IB_PORT_ACTIVE) { + return; + } ERROR: trailing whitespace #21: FILE: drivers/infiniband/ulp/ipoib/ipoib_ib.c:734: +^I^I} $ total: 3 errors, 1 warnings, 38 lines checked and as far as I can see, all of the errors and warnings are valid. I'm tired of fixing this type of crap up by hand, so until you send me a clean patch, I'm not going to apply this. b) I understand the issue you're trying to fix, but thinking about this, it seems that rather than picking the first entry in the p_key table happens to be for the main IPoIB interface, it would be simpler to understand if we just always used P_Key 0xffff for the main interface and let the user create whatever other interfaces desired for other P_Keys. Then there wouldn't be any race, and the situation would be easy to understand and manage. - R. From tziporet at mellanox.co.il Sun Jan 4 22:00:15 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 5 Jan 2009 08:00:15 +0200 Subject: [ofa-general] Agenda for the OFED meeting today (Jan 5, 09) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> Hello all, I hope we all had nice holidays and vacations, and now it's the time to get back to business. Agenda for OFED meeting today: 1. Conclusions from OFED 1.4 release: Open discussion 2. Do we wish to have OFED 1.4.: Please send pros & cons before the meeting 3. OFED 1.5: Schedule and features. This is what we presented in SC08 about 1.5: Preliminary Schedule: * Feature Freeze: 3/20/09 * Alpha Release: 3/20/09 * Beta Release: 4/20/09 * RC1: 5/5/09 * RC2-RCx: About every 2 weeks as needed * Release: June 2009 Features: * Kernel.org: 2.6.28 and 2.6.29 * Multiple Event Queues to support Multi-core CPUs * NFS/RDMA - GA * RDS support for iWARP * OpenMPI 1.3 * Add support/backports for RedHat EL 5.3 and EL 4.8, SLES 11 * Support for Mellanox vNIC (EoIB) and FCoIB with BridgeX device * more TBD... We also presented the OS matrix but I suggest we will close this in the next meeting. My proposal: * Have the release in July and not June - so we will have more time for development * Stick to one kernel version base and not change in the middle since we saw that changing the kernel base caused a delay. We need to decide in the meeting if it is 2.6.29 or we should wait for 2.6.30. * Add IB over Eth - this is similar to iWARP but more like IB (e.g. including UD), and can work over ConnectX. Please send your suggestions to the list before the meeting if possible Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From tziporet at dev.mellanox.co.il Mon Jan 5 02:57:52 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 05 Jan 2009 12:57:52 +0200 Subject: [ofa-general] Re: [PATCH] mlx4_ib: fix for bugzilla 1383 (LSO packet processing) In-Reply-To: References: <200812291223.11753.jackm@dev.mellanox.co.il> <200812301920.50336.jackm@dev.mellanox.co.il> Message-ID: <4961E7B0.2050806@mellanox.co.il> Roland Dreier wrote: > So do you think my patch (which avoids all the code duplication and > goto) is OK, or is that still too much overhead? Keep in mind the > global impact of a larger I-cache footprint because of the code > duplication... > > > We didn't have time to measure it yet Will send an update once it will be done Tziporet From vlad at lists.openfabrics.org Mon Jan 5 03:16:15 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 5 Jan 2009 03:16:15 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090105-0200 daily build status Message-ID: <20090105111615.2FDD8E60E76@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Mon Jan 5 04:34:22 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 5 Jan 2009 14:34:22 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <4950E07F.6090104@gmail.com> References: <4950E07F.6090104@gmail.com> Message-ID: <20090105123422.GD1494@sashak.voltaire.com> Hi Eli, On 14:58 Tue 23 Dec , Eli Dorfman (Voltaire) wrote: > Add support for PortXmitWait counter > Show PortCounters::PortXmitWait when this capability is supported by the firmware. > If not supported show this counter as 0. > > Signed-off-by: Eli Dorfman > --- > infiniband-diags/src/perfquery.c | 10 +++++++++- > libibmad/include/infiniband/mad.h | 1 + > libibmad/src/fields.c | 1 + > 3 files changed, 11 insertions(+), 1 deletions(-) > > diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c > index 7a53e92..4166fff 100644 > --- a/infiniband-diags/src/perfquery.c > +++ b/infiniband-diags/src/perfquery.c > @@ -68,6 +68,7 @@ struct perf_count { > uint32_t rcvdata; > uint32_t xmtpkts; > uint32_t rcvpkts; > + uint32_t xmtwait; > }; > > struct perf_count_ext { > @@ -210,6 +211,8 @@ static void aggregate_perfcounters(void) > aggregate_32bit(&perf_count.xmtpkts, val); > mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); > aggregate_32bit(&perf_count.rcvpkts, val); > + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); > + aggregate_32bit(&perf_count.xmtwait, val); > } Should XMT_WAIT support be added to output_aggregate_perfcounters(), reset and other places too? > > static void output_aggregate_perfcounters(ib_portid_t *portid) > @@ -299,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p > if (extended != 1) { > if (!port_performance_query(pc, portid, port, timeout)) > IBERROR("perfquery"); > + if (!(cap_mask & 0x1000)) { > + /* if PortCounters:PortXmitWait not suppported clear this counter */ > + perf_count.xmtwait = 0; > + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); > + } > if (aggregate) > aggregate_perfcounters(); > - else > + else > mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); > } else { > if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index c2ad148..6c313f9 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -413,6 +413,7 @@ enum MAD_FIELDS { > IB_PC_RCV_BYTES_F, > IB_PC_XMT_PKTS_F, > IB_PC_RCV_PKTS_F, > + IB_PC_XMT_WAIT_F, > IB_PC_LAST_F, > > /* Basically I'm fine to have two separate patches - one to support XMT_WAIT in libibmad and another one for perfquery, this is a minor although. Sasha > diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c > index 6942e85..116e432 100644 > --- a/libibmad/src/fields.c > +++ b/libibmad/src/fields.c > @@ -247,6 +247,7 @@ ib_field_t ib_mad_f [] = { > [IB_PC_RCV_BYTES_F] {224, 32, "RcvData", mad_dump_uint}, > [IB_PC_XMT_PKTS_F] {256, 32, "XmtPkts", mad_dump_uint}, > [IB_PC_RCV_PKTS_F] {288, 32, "RcvPkts", mad_dump_uint}, > + [IB_PC_XMT_WAIT_F] {320, 32, "XmtWait", mad_dump_uint}, > > /* > * SMInfo > -- > 1.5.5 > From sashak at voltaire.com Mon Jan 5 04:42:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 5 Jan 2009 14:42:11 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c Fix memory leak for QOS string parameters. In-Reply-To: <494E2E5A.8050008@gmail.com> References: <494E2E5A.8050008@gmail.com> Message-ID: <20090105124201.GE1494@sashak.voltaire.com> On 13:54 Sun 21 Dec , Eli Dorfman (Voltaire) wrote: > Fix memory leak for QOS string parameters. > > Signed-off-by: Slava Strebkov Applied. Thanks. Some note about patch formatting. If Slava Strebkov is an actual patch's author this should be indicated in the patch comment body in "From:" line (and so 'git am' will detect the author automatically). Also I would expect your s-o-b as well - something like: From: Slava Strebkov Fix memory leak for QOS string parameters. Signed-off-by: Slava Strebkov Signed-off-by: Eli Dorfman For more details look at /usr/src/linux/Documentation/SubmittingPatches. Sasha From dorfman.eli at gmail.com Mon Jan 5 04:47:19 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 05 Jan 2009 14:47:19 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <20090105123422.GD1494@sashak.voltaire.com> References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> Message-ID: <49620157.50003@gmail.com> Sasha Khapyorsky wrote: > Hi Eli, > > On 14:58 Tue 23 Dec , Eli Dorfman (Voltaire) wrote: >> Add support for PortXmitWait counter >> Show PortCounters::PortXmitWait when this capability is supported by the firmware. >> If not supported show this counter as 0. >> >> Signed-off-by: Eli Dorfman >> --- >> infiniband-diags/src/perfquery.c | 10 +++++++++- >> libibmad/include/infiniband/mad.h | 1 + >> libibmad/src/fields.c | 1 + >> 3 files changed, 11 insertions(+), 1 deletions(-) >> >> diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c >> index 7a53e92..4166fff 100644 >> --- a/infiniband-diags/src/perfquery.c >> +++ b/infiniband-diags/src/perfquery.c >> @@ -68,6 +68,7 @@ struct perf_count { >> uint32_t rcvdata; >> uint32_t xmtpkts; >> uint32_t rcvpkts; >> + uint32_t xmtwait; >> }; >> >> struct perf_count_ext { >> @@ -210,6 +211,8 @@ static void aggregate_perfcounters(void) >> aggregate_32bit(&perf_count.xmtpkts, val); >> mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); >> aggregate_32bit(&perf_count.rcvpkts, val); >> + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); >> + aggregate_32bit(&perf_count.xmtwait, val); >> } > > Should XMT_WAIT support be added to output_aggregate_perfcounters(), > reset and other places too? reset is not supported by the firmware (at the moment). need to add xmitwait to output_aggregate_perfcounters() as well. > >> >> static void output_aggregate_perfcounters(ib_portid_t *portid) >> @@ -299,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p >> if (extended != 1) { >> if (!port_performance_query(pc, portid, port, timeout)) >> IBERROR("perfquery"); >> + if (!(cap_mask & 0x1000)) { >> + /* if PortCounters:PortXmitWait not suppported clear this counter */ >> + perf_count.xmtwait = 0; >> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); >> + } >> if (aggregate) >> aggregate_perfcounters(); >> - else >> + else >> mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); >> } else { >> if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ >> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h >> index c2ad148..6c313f9 100644 >> --- a/libibmad/include/infiniband/mad.h >> +++ b/libibmad/include/infiniband/mad.h >> @@ -413,6 +413,7 @@ enum MAD_FIELDS { >> IB_PC_RCV_BYTES_F, >> IB_PC_XMT_PKTS_F, >> IB_PC_RCV_PKTS_F, >> + IB_PC_XMT_WAIT_F, >> IB_PC_LAST_F, >> >> /* > > Basically I'm fine to have two separate patches - one to support > XMT_WAIT in libibmad and another one for perfquery, this is a minor > although. it is all part of the same change. > > Sasha > >> diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c >> index 6942e85..116e432 100644 >> --- a/libibmad/src/fields.c >> +++ b/libibmad/src/fields.c >> @@ -247,6 +247,7 @@ ib_field_t ib_mad_f [] = { >> [IB_PC_RCV_BYTES_F] {224, 32, "RcvData", mad_dump_uint}, >> [IB_PC_XMT_PKTS_F] {256, 32, "XmtPkts", mad_dump_uint}, >> [IB_PC_RCV_PKTS_F] {288, 32, "RcvPkts", mad_dump_uint}, >> + [IB_PC_XMT_WAIT_F] {320, 32, "XmtWait", mad_dump_uint}, >> >> /* >> * SMInfo >> -- >> 1.5.5 >> From john.russo at qlogic.com Mon Jan 5 08:31:14 2009 From: john.russo at qlogic.com (John Russo) Date: Mon, 5 Jan 2009 10:31:14 -0600 Subject: [ofa-general] RE: Agenda for the OFED meeting today (Jan 5, 09) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> Message-ID: Another suggestion for 1.5 Implementation of SA queries for Path Records (using IBTA 1.2.1 ServiceId field) in all OFED ULPs, especially for MPI The IBTA standard defines that the proper way to establish a connection is to get a PathRecord from the SM/SA and use it to define all the attributes of the communication path. Ideally the IBTA CM should then be used to establish the connection and QPs as well. At present, openmpi, mvapich1 and mvapich2 do not use PathRecords, but instead hard code attributes like the PKey, SL, etc. In some cases these hardcoded values can be overridden by configurable values such as PKey and SL, but such values must be uniform across all connections and must be provided per job (which can be error prone/tedious). At present opensm supports PKeys and SLs, however MPI cannot easily use these features. Other features, such as lash routing, in opensm do not work properly with MPI because the SL must be uniform across all connections, but for lash it will vary per route. Additionally, applications which do not use PathRecords will have difficulties with advanced features like IB routing, partitioning, etc. All of which are available or being worked on in opensm. ________________________________ From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Tziporet Koren Sent: Monday, January 05, 2009 1:00 AM To: ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ewg] Agenda for the OFED meeting today (Jan 5, 09) Hello all, I hope we all had nice holidays and vacations, and now it's the time to get back to business. Agenda for OFED meeting today: 1. Conclusions from OFED 1.4 release: Open discussion 2. Do we wish to have OFED 1.4.: Please send pros & cons before the meeting 3. OFED 1.5: Schedule and features. This is what we presented in SC08 about 1.5: Preliminary Schedule: * Feature Freeze: 3/20/09 * Alpha Release: 3/20/09 * Beta Release: 4/20/09 * RC1: 5/5/09 * RC2-RCx: About every 2 weeks as needed * Release: June 2009 Features: * Kernel.org: 2.6.28 and 2.6.29 * Multiple Event Queues to support Multi-core CPUs * NFS/RDMA - GA * RDS support for iWARP * OpenMPI 1.3 * Add support/backports for RedHat EL 5.3 and EL 4.8, SLES 11 * Support for Mellanox vNIC (EoIB) and FCoIB with BridgeX device * more TBD... We also presented the OS matrix but I suggest we will close this in the next meeting. My proposal: * Have the release in July and not June - so we will have more time for development * Stick to one kernel version base and not change in the middle since we saw that changing the kernel base caused a delay. We need to decide in the meeting if it is 2.6.29 or we should wait for 2.6.30. * Add IB over Eth - this is similar to iWARP but more like IB (e.g. including UD), and can work over ConnectX. Please send your suggestions to the list before the meeting if possible Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Mon Jan 5 08:39:03 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 5 Jan 2009 11:39:03 -0500 Subject: [ofa-general] RE: Agenda for the OFED meeting today (Jan 5, 09) In-Reply-To: References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> Message-ID: Would all MPI processes need to query the SM for each path that they want to use in a QP? On Jan 5, 2009, at 11:31 AM, John Russo wrote: > Another suggestion for 1.5 > > Implementation of SA queries for Path Records (using IBTA 1.2.1 > ServiceId field) in all OFED ULPs, especially for MPI > The IBTA standard defines that the proper way to > establish a connection is to get a PathRecord from the SM/SA and use > it to define all the attributes of the communication path. > Ideally the IBTA CM should then be used to establish the connection > and QPs as well. > > At present, openmpi, mvapich1 and mvapich2 do not use PathRecords, > but instead hard code attributes like the PKey, SL, etc. > In some cases these hardcoded values can be overridden by > configurable values such as PKey and SL, but such values must be > uniform across all connections and must be provided per job (which > can be error prone/tedious). > > At present opensm supports PKeys and SLs, however MPI > cannot easily use these features. > Other features, such as lash routing, in opensm do not work properly > with MPI because the SL must be uniform across all connections, but > for lash it will vary per route. > > Additionally, applications which do not use PathRecords will have > difficulties with advanced features like IB routing, partitioning, > etc. All of which are available or being worked on in opensm. > > From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org > ] On Behalf Of Tziporet Koren > Sent: Monday, January 05, 2009 1:00 AM > To: ewg at lists.openfabrics.org > Cc: general at lists.openfabrics.org > Subject: [ewg] Agenda for the OFED meeting today (Jan 5, 09) > > > Hello all, > > I hope we all had nice holidays and vacations, and now it’s the time > to get back to business. > > Agenda for OFED meeting today: > > 1. Conclusions from OFED 1.4 release: Open discussion > > 2. Do we wish to have OFED 1.4.: Please send pros & cons before the > meeting > > 3. OFED 1.5: Schedule and features. > > This is what we presented in SC08 about 1.5: > > Preliminary Schedule: > > • Feature Freeze: 3/20/09 > • Alpha Release: 3/20/09 > • Beta Release: 4/20/09 > • RC1: 5/5/09 > • RC2-RCx: About every 2 weeks as needed > • Release: June 2009 > Features: > > • Kernel.org: 2.6.28 and 2.6.29 > • Multiple Event Queues to support Multi-core CPUs > • NFS/RDMA – GA > • RDS support for iWARP > • OpenMPI 1.3 > • Add support/backports for RedHat EL 5.3 and EL 4.8, SLES 11 > • Support for Mellanox vNIC (EoIB) and FCoIB with BridgeX device > • more TBD… > > We also presented the OS matrix but I suggest we will close this in > the next meeting. > > My proposal: > > • Have the release in July and not June - so we will have more time > for development > • Stick to one kernel version base and not change in the middle > since we saw that changing the kernel base caused a delay. > We need to decide in the meeting if it is 2.6.29 or we should wait > for 2.6.30. > • Add IB over Eth - this is similar to iWARP but more like IB (e.g. > including UD), and can work over ConnectX. > > Please send your suggestions to the list before the meeting if > possible > > Tziporet > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jeff Squyres Cisco Systems From swise at opengridcomputing.com Mon Jan 5 08:43:21 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 05 Jan 2009 10:43:21 -0600 Subject: [ofa-general] ConnectX hca problem Message-ID: <496238A9.5010605@opengridcomputing.com> Mellanox experts: I'm having problems with one of my ConnectX cards. When I load mlx4_core, it logs the following error. Any suggestions on how to proceed? Steve ------- mlx4_core: Mellanox ConnectX core driver v0.01 (May 1, 2007) mlx4_core: Initializing 0000:0c:00.0 mlx4_core 0000:0c:00.0: PCI INT A -> GSI 33 (level, low) -> IRQ 33 mlx4_core 0000:0c:00.0: setting latency timer to 64 mlx4_core 0000:0c:00.0: RUN_FW command failed, aborting. mlx4_core 0000:0c:00.0: Failed to start FW, aborting. mlx4_core 0000:0c:00.0: PCI INT A disabled mlx4_core: probe of 0000:0c:00.0 failed with error -22 ------ Here is the lspci info: 0c:00.0 InfiniBand: Mellanox Technologies Unknown device 673c (rev a0) Subsystem: Mellanox Technologies Unknown device 673c Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> Message-ID: When using routing algorithms such as lash, the SL used per end to end connection will vary based on the route. lash uses multiple VLs to avoid credit loops. As such the SL reported will vary based on fabric topology and which pair of end nodes the path is being requested on behalf of. -----Original Message----- From: Jeff Squyres [mailto:jsquyres at cisco.com] Sent: Monday, January 05, 2009 11:39 AM To: John Russo Cc: Tziporet Koren; ewg at lists.openfabrics.org; general at lists.openfabrics.org Subject: Re: [ofa-general] RE: Agenda for the OFED meeting today (Jan 5, 09) Would all MPI processes need to query the SM for each path that they want to use in a QP? On Jan 5, 2009, at 11:31 AM, John Russo wrote: > Another suggestion for 1.5 > > Implementation of SA queries for Path Records (using IBTA 1.2.1 > ServiceId field) in all OFED ULPs, especially for MPI > The IBTA standard defines that the proper way to > establish a connection is to get a PathRecord from the SM/SA and use > it to define all the attributes of the communication path. > Ideally the IBTA CM should then be used to establish the connection > and QPs as well. > > At present, openmpi, mvapich1 and mvapich2 do not use PathRecords, > but instead hard code attributes like the PKey, SL, etc. > In some cases these hardcoded values can be overridden by > configurable values such as PKey and SL, but such values must be > uniform across all connections and must be provided per job (which > can be error prone/tedious). > > At present opensm supports PKeys and SLs, however MPI > cannot easily use these features. > Other features, such as lash routing, in opensm do not work properly > with MPI because the SL must be uniform across all connections, but > for lash it will vary per route. > > Additionally, applications which do not use PathRecords will have > difficulties with advanced features like IB routing, partitioning, > etc. All of which are available or being worked on in opensm. > > From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org > ] On Behalf Of Tziporet Koren > Sent: Monday, January 05, 2009 1:00 AM > To: ewg at lists.openfabrics.org > Cc: general at lists.openfabrics.org > Subject: [ewg] Agenda for the OFED meeting today (Jan 5, 09) > > > Hello all, > > I hope we all had nice holidays and vacations, and now it's the time > to get back to business. > > Agenda for OFED meeting today: > > 1. Conclusions from OFED 1.4 release: Open discussion > > 2. Do we wish to have OFED 1.4.: Please send pros & cons before the > meeting > > 3. OFED 1.5: Schedule and features. > > This is what we presented in SC08 about 1.5: > > Preliminary Schedule: > > * Feature Freeze: 3/20/09 > * Alpha Release: 3/20/09 > * Beta Release: 4/20/09 > * RC1: 5/5/09 > * RC2-RCx: About every 2 weeks as needed > * Release: June 2009 > Features: > > * Kernel.org: 2.6.28 and 2.6.29 > * Multiple Event Queues to support Multi-core CPUs > * NFS/RDMA - GA > * RDS support for iWARP > * OpenMPI 1.3 > * Add support/backports for RedHat EL 5.3 and EL 4.8, SLES 11 > * Support for Mellanox vNIC (EoIB) and FCoIB with BridgeX device > * more TBD... > > We also presented the OS matrix but I suggest we will close this in > the next meeting. > > My proposal: > > * Have the release in July and not June - so we will have more time > for development > * Stick to one kernel version base and not change in the middle > since we saw that changing the kernel base caused a delay. > We need to decide in the meeting if it is 2.6.29 or we should wait > for 2.6.30. > * Add IB over Eth - this is similar to iWARP but more like IB (e.g. > including UD), and can work over ConnectX. > > Please send your suggestions to the list before the meeting if > possible > > Tziporet > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Mon Jan 5 09:06:56 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 5 Jan 2009 12:06:56 -0500 Subject: [ofa-general] RE: Agenda for the OFED meeting today (Jan 5, 09) In-Reply-To: References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> Message-ID: <5FD600CA-26E4-4D4C-B31D-574D79D56D40@cisco.com> Hmm. Perhaps I'm not grokking your answer -- did you answer my question? I'm indirectly asking about scalability of the SM to have hundreds/thousands of MPI processes simultaneously querying the SM. On Jan 5, 2009, at 11:47 AM, John Russo wrote: > When using routing algorithms such as lash, the SL used per end to > end connection will vary based on the route. lash uses multiple VLs > to avoid credit loops. As such the SL reported will vary based on > fabric topology and which pair of end nodes the path is being > requested on behalf of. > > -----Original Message----- > From: Jeff Squyres [mailto:jsquyres at cisco.com] > Sent: Monday, January 05, 2009 11:39 AM > To: John Russo > Cc: Tziporet Koren; ewg at lists.openfabrics.org; general at lists.openfabrics.org > Subject: Re: [ofa-general] RE: Agenda for the OFED meeting today > (Jan 5, 09) > > Would all MPI processes need to query the SM for each path that they > want to use in a QP? > > > On Jan 5, 2009, at 11:31 AM, John Russo wrote: > >> Another suggestion for 1.5 >> >> Implementation of SA queries for Path Records (using IBTA 1.2.1 >> ServiceId field) in all OFED ULPs, especially for MPI >> The IBTA standard defines that the proper way to >> establish a connection is to get a PathRecord from the SM/SA and use >> it to define all the attributes of the communication path. >> Ideally the IBTA CM should then be used to establish the connection >> and QPs as well. >> >> At present, openmpi, mvapich1 and mvapich2 do not use PathRecords, >> but instead hard code attributes like the PKey, SL, etc. >> In some cases these hardcoded values can be overridden by >> configurable values such as PKey and SL, but such values must be >> uniform across all connections and must be provided per job (which >> can be error prone/tedious). >> >> At present opensm supports PKeys and SLs, however MPI >> cannot easily use these features. >> Other features, such as lash routing, in opensm do not work properly >> with MPI because the SL must be uniform across all connections, but >> for lash it will vary per route. >> >> Additionally, applications which do not use PathRecords will have >> difficulties with advanced features like IB routing, partitioning, >> etc. All of which are available or being worked on in opensm. >> >> From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org >> ] On Behalf Of Tziporet Koren >> Sent: Monday, January 05, 2009 1:00 AM >> To: ewg at lists.openfabrics.org >> Cc: general at lists.openfabrics.org >> Subject: [ewg] Agenda for the OFED meeting today (Jan 5, 09) >> >> >> Hello all, >> >> I hope we all had nice holidays and vacations, and now it's the time >> to get back to business. >> >> Agenda for OFED meeting today: >> >> 1. Conclusions from OFED 1.4 release: Open discussion >> >> 2. Do we wish to have OFED 1.4.: Please send pros & cons before the >> meeting >> >> 3. OFED 1.5: Schedule and features. >> >> This is what we presented in SC08 about 1.5: >> >> Preliminary Schedule: >> >> * Feature Freeze: 3/20/09 >> * Alpha Release: 3/20/09 >> * Beta Release: 4/20/09 >> * RC1: 5/5/09 >> * RC2-RCx: About every 2 weeks as needed >> * Release: June 2009 >> Features: >> >> * Kernel.org: 2.6.28 and 2.6.29 >> * Multiple Event Queues to support Multi-core CPUs >> * NFS/RDMA - GA >> * RDS support for iWARP >> * OpenMPI 1.3 >> * Add support/backports for RedHat EL 5.3 and EL 4.8, SLES 11 >> * Support for Mellanox vNIC (EoIB) and FCoIB with BridgeX device >> * more TBD... >> >> We also presented the OS matrix but I suggest we will close this in >> the next meeting. >> >> My proposal: >> >> * Have the release in July and not June - so we will have more time >> for development >> * Stick to one kernel version base and not change in the middle >> since we saw that changing the kernel base caused a delay. >> We need to decide in the meeting if it is 2.6.29 or we should wait >> for 2.6.30. >> * Add IB over Eth - this is similar to iWARP but more like IB (e.g. >> including UD), and can work over ConnectX. >> >> Please send your suggestions to the list before the meeting if >> possible >> >> Tziporet >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > -- > Jeff Squyres > Cisco Systems > -- Jeff Squyres Cisco Systems From john.russo at qlogic.com Mon Jan 5 09:17:41 2009 From: john.russo at qlogic.com (John Russo) Date: Mon, 5 Jan 2009 11:17:41 -0600 Subject: [ofa-general] RE: Agenda for the OFED meeting today (Jan 5, 09) In-Reply-To: <5FD600CA-26E4-4D4C-B31D-574D79D56D40@cisco.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> <5FD600CA-26E4-4D4C-B31D-574D79D56D40@cisco.com> Message-ID: Sorry Jeff... I am playing "middle man" with another engineer here for this information... The query does not have to be per QP, but it does need to be per IB HCA port to IB HCA port communication path. For example, if a node has 64 CPUs, it could do the query once per each other node on behalf of the 64 processes. Its still an N^2 set of queries, but at least N can be reduced to be the number of end node IB ports as opposed to the number of processes. -----Original Message----- From: Jeff Squyres [mailto:jsquyres at cisco.com] Sent: Monday, January 05, 2009 12:07 PM To: John Russo Cc: Tziporet Koren; ewg at lists.openfabrics.org; general at lists.openfabrics.org Subject: Re: [ofa-general] RE: Agenda for the OFED meeting today (Jan 5, 09) Hmm. Perhaps I'm not grokking your answer -- did you answer my question? I'm indirectly asking about scalability of the SM to have hundreds/thousands of MPI processes simultaneously querying the SM. On Jan 5, 2009, at 11:47 AM, John Russo wrote: > When using routing algorithms such as lash, the SL used per end to > end connection will vary based on the route. lash uses multiple VLs > to avoid credit loops. As such the SL reported will vary based on > fabric topology and which pair of end nodes the path is being > requested on behalf of. > > -----Original Message----- > From: Jeff Squyres [mailto:jsquyres at cisco.com] > Sent: Monday, January 05, 2009 11:39 AM > To: John Russo > Cc: Tziporet Koren; ewg at lists.openfabrics.org; general at lists.openfabrics.org > Subject: Re: [ofa-general] RE: Agenda for the OFED meeting today > (Jan 5, 09) > > Would all MPI processes need to query the SM for each path that they > want to use in a QP? > > > On Jan 5, 2009, at 11:31 AM, John Russo wrote: > >> Another suggestion for 1.5 >> >> Implementation of SA queries for Path Records (using IBTA 1.2.1 >> ServiceId field) in all OFED ULPs, especially for MPI >> The IBTA standard defines that the proper way to >> establish a connection is to get a PathRecord from the SM/SA and use >> it to define all the attributes of the communication path. >> Ideally the IBTA CM should then be used to establish the connection >> and QPs as well. >> >> At present, openmpi, mvapich1 and mvapich2 do not use PathRecords, >> but instead hard code attributes like the PKey, SL, etc. >> In some cases these hardcoded values can be overridden by >> configurable values such as PKey and SL, but such values must be >> uniform across all connections and must be provided per job (which >> can be error prone/tedious). >> >> At present opensm supports PKeys and SLs, however MPI >> cannot easily use these features. >> Other features, such as lash routing, in opensm do not work properly >> with MPI because the SL must be uniform across all connections, but >> for lash it will vary per route. >> >> Additionally, applications which do not use PathRecords will have >> difficulties with advanced features like IB routing, partitioning, >> etc. All of which are available or being worked on in opensm. >> >> From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org >> ] On Behalf Of Tziporet Koren >> Sent: Monday, January 05, 2009 1:00 AM >> To: ewg at lists.openfabrics.org >> Cc: general at lists.openfabrics.org >> Subject: [ewg] Agenda for the OFED meeting today (Jan 5, 09) >> >> >> Hello all, >> >> I hope we all had nice holidays and vacations, and now it's the time >> to get back to business. >> >> Agenda for OFED meeting today: >> >> 1. Conclusions from OFED 1.4 release: Open discussion >> >> 2. Do we wish to have OFED 1.4.: Please send pros & cons before the >> meeting >> >> 3. OFED 1.5: Schedule and features. >> >> This is what we presented in SC08 about 1.5: >> >> Preliminary Schedule: >> >> * Feature Freeze: 3/20/09 >> * Alpha Release: 3/20/09 >> * Beta Release: 4/20/09 >> * RC1: 5/5/09 >> * RC2-RCx: About every 2 weeks as needed >> * Release: June 2009 >> Features: >> >> * Kernel.org: 2.6.28 and 2.6.29 >> * Multiple Event Queues to support Multi-core CPUs >> * NFS/RDMA - GA >> * RDS support for iWARP >> * OpenMPI 1.3 >> * Add support/backports for RedHat EL 5.3 and EL 4.8, SLES 11 >> * Support for Mellanox vNIC (EoIB) and FCoIB with BridgeX device >> * more TBD... >> >> We also presented the OS matrix but I suggest we will close this in >> the next meeting. >> >> My proposal: >> >> * Have the release in July and not June - so we will have more time >> for development >> * Stick to one kernel version base and not change in the middle >> since we saw that changing the kernel base caused a delay. >> We need to decide in the meeting if it is 2.6.29 or we should wait >> for 2.6.30. >> * Add IB over Eth - this is similar to iWARP but more like IB (e.g. >> including UD), and can work over ConnectX. >> >> Please send your suggestions to the list before the meeting if >> possible >> >> Tziporet >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > -- > Jeff Squyres > Cisco Systems > -- Jeff Squyres Cisco Systems From jgunthorpe at obsidianresearch.com Mon Jan 5 10:42:19 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 5 Jan 2009 11:42:19 -0700 Subject: [ofa-general] [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <49620157.50003@gmail.com> References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> Message-ID: <20090105184219.GN31213@obsidianresearch.com> On Mon, Jan 05, 2009 at 02:47:19PM +0200, Eli Dorfman (Voltaire) wrote: > > Should XMT_WAIT support be added to output_aggregate_perfcounters(), > > reset and other places too? > > reset is not supported by the firmware (at the moment). > need to add xmitwait to output_aggregate_perfcounters() as well. Yeah, but other devices that support this counter do support reset :) Jason From arlin.r.davis at intel.com Mon Jan 5 11:01:22 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 5 Jan 2009 11:01:22 -0800 Subject: [ofa-general] RE: Agenda for the OFED meeting today (Jan 5, 09) In-Reply-To: References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> Message-ID: There are scaling issues with SA path-record queries. We attempted to be good citizens with Intel MPI using the rdma_cm agent (via uDAPL) but was forced to build hard-coded RC QP support in OFED 1.4 (uDAPL scm) to avoid the many scaling and configuration problems that came with IPoIB requirements, ARP storms, rdma_cm timers, and SA path record query/caching. If someone wants to sign up to design and implement a scalable SA query caching agent we would be happy to look at path record queries again. -arlin Another suggestion for 1.5 Implementation of SA queries for Path Records (using IBTA 1.2.1 ServiceId field) in all OFED ULPs, especially for MPI The IBTA standard defines that the proper way to establish a connection is to get a PathRecord from the SM/SA and use it to define all the attributes of the communication path. Ideally the IBTA CM should then be used to establish the connection and QPs as well. At present, openmpi, mvapich1 and mvapich2 do not use PathRecords, but instead hard code attributes like the PKey, SL, etc. In some cases these hardcoded values can be overridden by configurable values such as PKey and SL, but such values must be uniform across all connections and must be provided per job (which can be error prone/tedious). At present opensm supports PKeys and SLs, however MPI cannot easily use these features. Other features, such as lash routing, in opensm do not work properly with MPI because the SL must be uniform across all connections, but for lash it will vary per route. Additionally, applications which do not use PathRecords will have difficulties with advanced features like IB routing, partitioning, etc. All of which are available or being worked on in opensm. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Mon Jan 5 12:01:56 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 5 Jan 2009 15:01:56 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <4950E07F.6090104@gmail.com> References: <4950E07F.6090104@gmail.com> Message-ID: Eli, On Tue, Dec 23, 2008 at 7:58 AM, Eli Dorfman (Voltaire) wrote: > Add support for PortXmitWait counter > Show PortCounters::PortXmitWait when this capability is supported by the firmware. > If not supported show this counter as 0. IMO better would be to either not show this counter when not supported so this "difference" can be seen easier or indicate that it's not valid. The latter approach is probably easier to implement. -- Hal From hal.rosenstock at gmail.com Mon Jan 5 12:03:55 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 5 Jan 2009 15:03:55 -0500 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <49620157.50003@gmail.com> References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> Message-ID: Eli, On Mon, Jan 5, 2009 at 7:47 AM, Eli Dorfman (Voltaire) wrote: > Sasha Khapyorsky wrote: >> Hi Eli, >> >> On 14:58 Tue 23 Dec , Eli Dorfman (Voltaire) wrote: >>> Add support for PortXmitWait counter >>> Show PortCounters::PortXmitWait when this capability is supported by the firmware. >>> If not supported show this counter as 0. >>> >>> Signed-off-by: Eli Dorfman >>> --- >>> infiniband-diags/src/perfquery.c | 10 +++++++++- >>> libibmad/include/infiniband/mad.h | 1 + >>> libibmad/src/fields.c | 1 + >>> 3 files changed, 11 insertions(+), 1 deletions(-) >>> >>> diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c >>> index 7a53e92..4166fff 100644 >>> --- a/infiniband-diags/src/perfquery.c >>> +++ b/infiniband-diags/src/perfquery.c >>> @@ -68,6 +68,7 @@ struct perf_count { >>> uint32_t rcvdata; >>> uint32_t xmtpkts; >>> uint32_t rcvpkts; >>> + uint32_t xmtwait; >>> }; >>> >>> struct perf_count_ext { >>> @@ -210,6 +211,8 @@ static void aggregate_perfcounters(void) >>> aggregate_32bit(&perf_count.xmtpkts, val); >>> mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); >>> aggregate_32bit(&perf_count.rcvpkts, val); >>> + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); >>> + aggregate_32bit(&perf_count.xmtwait, val); >>> } >> >> Should XMT_WAIT support be added to output_aggregate_perfcounters(), >> reset and other places too? > > reset is not supported by the firmware (at the moment). What is the firmware response to a reset ? -- Hal > need to add xmitwait to output_aggregate_perfcounters() as well. > >> >>> >>> static void output_aggregate_perfcounters(ib_portid_t *portid) >>> @@ -299,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p >>> if (extended != 1) { >>> if (!port_performance_query(pc, portid, port, timeout)) >>> IBERROR("perfquery"); >>> + if (!(cap_mask & 0x1000)) { >>> + /* if PortCounters:PortXmitWait not suppported clear this counter */ >>> + perf_count.xmtwait = 0; >>> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); >>> + } >>> if (aggregate) >>> aggregate_perfcounters(); >>> - else >>> + else >>> mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); >>> } else { >>> if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ >>> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h >>> index c2ad148..6c313f9 100644 >>> --- a/libibmad/include/infiniband/mad.h >>> +++ b/libibmad/include/infiniband/mad.h >>> @@ -413,6 +413,7 @@ enum MAD_FIELDS { >>> IB_PC_RCV_BYTES_F, >>> IB_PC_XMT_PKTS_F, >>> IB_PC_RCV_PKTS_F, >>> + IB_PC_XMT_WAIT_F, >>> IB_PC_LAST_F, >>> >>> /* >> >> Basically I'm fine to have two separate patches - one to support >> XMT_WAIT in libibmad and another one for perfquery, this is a minor >> although. > > it is all part of the same change. > >> >> Sasha >> >>> diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c >>> index 6942e85..116e432 100644 >>> --- a/libibmad/src/fields.c >>> +++ b/libibmad/src/fields.c >>> @@ -247,6 +247,7 @@ ib_field_t ib_mad_f [] = { >>> [IB_PC_RCV_BYTES_F] {224, 32, "RcvData", mad_dump_uint}, >>> [IB_PC_XMT_PKTS_F] {256, 32, "XmtPkts", mad_dump_uint}, >>> [IB_PC_RCV_PKTS_F] {288, 32, "RcvPkts", mad_dump_uint}, >>> + [IB_PC_XMT_WAIT_F] {320, 32, "XmtWait", mad_dump_uint}, >>> >>> /* >>> * SMInfo >>> -- >>> 1.5.5 >>> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Mon Jan 5 12:21:17 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 5 Jan 2009 15:21:17 -0500 Subject: [ofa-general] infiniband on debian In-Reply-To: <200901041603447034776@163.com> References: <200901041603447034776@163.com> Message-ID: On Sun, Jan 4, 2009 at 3:03 AM, songzhonglei wrote: > > hi, > i wanted to enable ipoib on debian,so recompiled kernel with infiniband > supported. > after reboot i modprobe ib_ipoib > lsmod : ib_ipoib,ib_sa,ib_mthca,ib_mad,ib_core. > > and then ifconfig ib0 x.x.x.x.->ifconfig ib0 up,but failed. > dmesg showed: > ADDRCONF(NETDEV_UP):ib0:link is not ready > ib0:multicast join failed for > ff12:401b:ffff:0000:0000............................,status-22 Does ..... mean more 0000s and 0000s only ? Do you have an SM running somewhere in your subnet ? If so, what SM ? -- Hal > what's wrong?and what i should do if i want to use ipoib on debian? > > thank you. > 2009-01-04 > ________________________________ > songzhonglei > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From jsquyres at cisco.com Mon Jan 5 12:26:36 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 5 Jan 2009 15:26:36 -0500 Subject: [ofa-general] Re: [ewg] RE: Agenda for the OFED meeting today (Jan 5, 09) In-Reply-To: References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> Message-ID: <7B9683B6-C37A-4143-A9FD-456686CD448A@cisco.com> I chatted with John and Todd from QL on the phone today -- we basically came to the same conclusion: - need to beef-up opensm to be able to scalably handle lots of incoming path record lookups - need to beef-up the CM clients on the host (maybe; this work might already be done?) - need to see the current status of the SA caching stuff / re-open that discussion to see if the work can be completed, etc. It might also be worthwhile to start a whole new discussion about making a better CM (at least from the ULP perspective). One that offers simple mechanisms for those who don't need/care about the details, but also offers complex/detailed mechanisms (perhaps remarkably like today's mechanisms). On Jan 5, 2009, at 2:01 PM, Davis, Arlin R wrote: > > There are scaling issues with SA path-record queries. We attempted > to be good citizens with Intel MPI using the rdma_cm agent (via > uDAPL) but was forced to build hard-coded RC QP support in OFED 1.4 > (uDAPL scm) to avoid the many scaling and configuration problems > that came with IPoIB requirements, ARP storms, rdma_cm timers, and > SA path record query/caching. > If someone wants to sign up to design and implement a scalable SA > query caching agent we would be happy to look at path record queries > again. > > -arlin > > Another suggestion for 1.5 > > Implementation of SA queries for Path Records (using IBTA 1.2.1 > ServiceId field) in all OFED ULPs, especially for MPI > The IBTA standard defines that the proper way to > establish a connection is to get a PathRecord from the SM/SA and use > it to define all the attributes of the communication path. > Ideally the IBTA CM should then be used to establish the connection > and QPs as well. > > At present, openmpi, mvapich1 and mvapich2 do not use PathRecords, > but instead hard code attributes like the PKey, SL, etc. > In some cases these hardcoded values can be overridden by > configurable values such as PKey and SL, but such values must be > uniform across all connections and must be provided per job (which > can be error prone/tedious). > > At present opensm supports PKeys and SLs, however MPI > cannot easily use these features. > Other features, such as lash routing, in opensm do not work properly > with MPI because the SL must be uniform across all connections, but > for lash it will vary per route. > > Additionally, applications which do not use PathRecords will have > difficulties with advanced features like IB routing, partitioning, > etc. All of which are available or being worked on in opensm. > > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems From shemminger at vyatta.com Mon Jan 5 12:14:49 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Mon, 05 Jan 2009 12:14:49 -0800 Subject: [ofa-general] [PATCH 7/8] infiniband: driver API update References: <20090105201442.749889072@vyatta.com> Message-ID: <20090105201514.828840735@vyatta.com> An embedded and charset-unspecified text was scrubbed... Name: ib.patch URL: From rdreier at cisco.com Mon Jan 5 13:02:15 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jan 2009 13:02:15 -0800 Subject: [ofa-general] Re: [PATCH 7/8] infiniband: driver API update In-Reply-To: <20090105201514.828840735@vyatta.com> (Stephen Hemminger's message of "Mon, 05 Jan 2009 12:14:49 -0800") References: <20090105201442.749889072@vyatta.com> <20090105201514.828840735@vyatta.com> Message-ID: Looks good enough, so Dave if you want to merge it, you can add Acked-by: Roland Dreier or let me know if you want me to pull it in through my tree. A couple of nits that are probably not worth fixing. First, globally, it might be slightly nicer to merge this as one patch per module, rather than all lumped together. And also: > +static const struct net_device_ops c2_netdev_ops = { > + .ndo_open = c2_pseudo_up, > + .ndo_stop = c2_pseudo_down, > + .ndo_start_xmit = c2_pseudo_xmit_frame, > + .ndo_change_mtu = c2_pseudo_change_mtu, > +}; > + > + would have preferred only one empty line here. > @@ -735,7 +737,6 @@ static void setup(struct net_device *net > netdev->addr_len = ETH_ALEN; > netdev->tx_queue_len = 0; > netdev->flags |= IFF_NOARP; > - return; > } > > static struct net_device *c2_pseudo_netdev_init(struct c2_dev *c2dev) would preferred to leave out unrelated changes. > +static const struct net_device_ops nes_netdev_ops = { > + .ndo_open = nes_netdev_open, > + .ndo_stop = nes_netdev_stop, > + .ndo_start_xmit = nes_netdev_start_xmit, > + .ndo_get_stats = nes_netdev_get_stats, > + .ndo_tx_timeout = nes_netdev_tx_timeout, > + .ndo_validate_addr = eth_validate_addr, > + .ndo_set_mac_address = nes_netdev_set_mac_address, > + .ndo_set_multicast_list = nes_netdev_set_multicast_list, > + .ndo_change_mtu = nes_netdev_change_mtu, > + .ndo_vlan_rx_register = nes_netdev_vlan_rx_register, > +}; > + > > /** extra blank line here too. - R. From shemminger at vyatta.com Mon Jan 5 13:23:48 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Mon, 5 Jan 2009 13:23:48 -0800 Subject: [ofa-general] [PATCH 3/3] infiniband: ipoib convert to net_device_ops In-Reply-To: <20090105132254.052633d7@extreme> References: <20090105201442.749889072@vyatta.com> <20090105201514.828840735@vyatta.com> <20090105132211.70baefb9@extreme> <20090105132254.052633d7@extreme> Message-ID: <20090105132348.7cac98cb@extreme> Signed-off-by: Stephen Hemminger --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2009-01-05 13:14:20.970318942 -0800 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2009-01-05 13:19:46.657819176 -0800 @@ -1013,18 +1013,22 @@ static void ipoib_lro_setup(struct ipoib priv->lro.lro_mgr.ip_summed_aggr = CHECKSUM_UNNECESSARY; } +static const struct net_device_ops ipoib_netdev_ops = { + .ndo_open = ipoib_open, + .ndo_stop = ipoib_stop, + .ndo_change_mtu = ipoib_change_mtu, + .ndo_start_xmit = ipoib_start_xmit, + .ndo_tx_timeout = ipoib_timeout, + .ndo_set_multicast_list = ipoib_set_mcast_list, + .ndo_neigh_setup = ipoib_neigh_setup_dev, +}; + static void ipoib_setup(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - dev->open = ipoib_open; - dev->stop = ipoib_stop; - dev->change_mtu = ipoib_change_mtu; - dev->hard_start_xmit = ipoib_start_xmit; - dev->tx_timeout = ipoib_timeout; - dev->header_ops = &ipoib_header_ops; - dev->set_multicast_list = ipoib_set_mcast_list; - dev->neigh_setup = ipoib_neigh_setup_dev; + dev->netdev_ops = &ipoib_netdev_ops; + dev->header_ops = &ipoib_header_ops; ipoib_set_ethtool_ops(dev); From shemminger at vyatta.com Mon Jan 5 13:22:11 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Mon, 5 Jan 2009 13:22:11 -0800 Subject: [ofa-general] [PATCH 1/3] infiniband: amso100 convert to net_device_ops In-Reply-To: References: <20090105201442.749889072@vyatta.com> <20090105201514.828840735@vyatta.com> Message-ID: <20090105132211.70baefb9@extreme> Convert to net_device_ops and remove leftover last_rx. Signed-off-by: Stephen Hemminger --- a/drivers/infiniband/hw/amso1100/c2.c 2009-01-05 13:14:20.990323888 -0800 +++ b/drivers/infiniband/hw/amso1100/c2.c 2009-01-05 13:16:48.581570680 -0800 @@ -76,7 +76,6 @@ static irqreturn_t c2_interrupt(int irq, static void c2_tx_timeout(struct net_device *netdev); static int c2_change_mtu(struct net_device *netdev, int new_mtu); static void c2_reset(struct c2_port *c2_port); -static struct net_device_stats *c2_get_stats(struct net_device *netdev); static struct pci_device_id c2_pci_table[] = { { PCI_DEVICE(0x18b8, 0xb001) }, @@ -531,7 +530,6 @@ static void c2_rx_interrupt(struct net_d netif_rx(skb); - netdev->last_rx = jiffies; c2_port->netstats.rx_packets++; c2_port->netstats.rx_bytes += buflen; } @@ -880,6 +878,17 @@ static int c2_change_mtu(struct net_devi return ret; } +static const struct net_device_ops c2_netdev_ops = { + .ndo_open = c2_up, + .ndo_stop = c2_down, + .ndo_start_xmit = c2_xmit_frame, + .ndo_get_stats = c2_get_stats, + .ndo_tx_timeout = c2_tx_timeout, + .ndo_change_mtu = c2_change_mtu, + .ndo_set_mac_address = eth_mac_addr, + .ndo_validate_addr = eth_validate_addr, +}; + /* Initialize network device */ static struct net_device *c2_devinit(struct c2_dev *c2dev, void __iomem * mmio_addr) @@ -894,12 +903,7 @@ static struct net_device *c2_devinit(str SET_NETDEV_DEV(netdev, &c2dev->pcidev->dev); - netdev->open = c2_up; - netdev->stop = c2_down; - netdev->hard_start_xmit = c2_xmit_frame; - netdev->get_stats = c2_get_stats; - netdev->tx_timeout = c2_tx_timeout; - netdev->change_mtu = c2_change_mtu; + netdev->netdev_ops = &c2_netdev_ops; netdev->watchdog_timeo = C2_TX_TIMEOUT; netdev->irq = c2dev->pcidev->irq; --- a/drivers/infiniband/hw/amso1100/c2_provider.c 2009-01-05 13:14:20.978323767 -0800 +++ b/drivers/infiniband/hw/amso1100/c2_provider.c 2009-01-05 13:17:09.911803913 -0800 @@ -719,15 +719,16 @@ static int c2_pseudo_change_mtu(struct n return ret; } +static const struct net_device_ops c2_netdev_ops = { + .ndo_open = c2_pseudo_up, + .ndo_stop = c2_pseudo_down, + .ndo_start_xmit = c2_pseudo_xmit_frame, + .ndo_change_mtu = c2_pseudo_change_mtu, +}; + static void setup(struct net_device *netdev) { - netdev->open = c2_pseudo_up; - netdev->stop = c2_pseudo_down; - netdev->hard_start_xmit = c2_pseudo_xmit_frame; - netdev->get_stats = NULL; - netdev->tx_timeout = NULL; - netdev->set_mac_address = NULL; - netdev->change_mtu = c2_pseudo_change_mtu; + netdev->netdev_ops = &c2_netdev_ops; netdev->watchdog_timeo = 0; netdev->type = ARPHRD_ETHER; netdev->mtu = 1500; From shemminger at vyatta.com Mon Jan 5 13:22:54 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Mon, 5 Jan 2009 13:22:54 -0800 Subject: [ofa-general] [PATCH 2/3] infiniband: nes_nic convert to net_device_ops In-Reply-To: <20090105132211.70baefb9@extreme> References: <20090105201442.749889072@vyatta.com> <20090105201514.828840735@vyatta.com> <20090105132211.70baefb9@extreme> Message-ID: <20090105132254.052633d7@extreme> Signed-off-by: Stephen Hemminger --- a/drivers/infiniband/hw/nes/nes_nic.c 2009-01-05 13:14:21.022323208 -0800 +++ b/drivers/infiniband/hw/nes/nes_nic.c 2009-01-05 13:18:32.554124317 -0800 @@ -1568,6 +1568,18 @@ static void nes_netdev_vlan_rx_register( spin_unlock_irqrestore(&nesadapter->phy_lock, flags); } +static const struct net_device_ops nes_netdev_ops = { + .ndo_open = nes_netdev_open, + .ndo_stop = nes_netdev_stop, + .ndo_start_xmit = nes_netdev_start_xmit, + .ndo_get_stats = nes_netdev_get_stats, + .ndo_tx_timeout = nes_netdev_tx_timeout, + .ndo_validate_addr = eth_validate_addr, + .ndo_set_mac_address = nes_netdev_set_mac_address, + .ndo_set_multicast_list = nes_netdev_set_multicast_list, + .ndo_change_mtu = nes_netdev_change_mtu, + .ndo_vlan_rx_register = nes_netdev_vlan_rx_register, +}; /** * nes_netdev_init - initialize network device @@ -1596,14 +1608,7 @@ struct net_device *nes_netdev_init(struc nesvnic = netdev_priv(netdev); memset(nesvnic, 0, sizeof(*nesvnic)); - netdev->open = nes_netdev_open; - netdev->stop = nes_netdev_stop; - netdev->hard_start_xmit = nes_netdev_start_xmit; - netdev->get_stats = nes_netdev_get_stats; - netdev->tx_timeout = nes_netdev_tx_timeout; - netdev->set_mac_address = nes_netdev_set_mac_address; - netdev->set_multicast_list = nes_netdev_set_multicast_list; - netdev->change_mtu = nes_netdev_change_mtu; + netdev->netdev_ops = &nes_netdev_ops; netdev->watchdog_timeo = NES_TX_TIMEOUT; netdev->irq = nesdev->pcidev->irq; netdev->mtu = ETH_DATA_LEN; @@ -1615,7 +1620,6 @@ struct net_device *nes_netdev_init(struc netif_napi_add(netdev, &nesvnic->napi, nes_netdev_poll, 128); nes_debug(NES_DBG_INIT, "Enabling VLAN Insert/Delete.\n"); netdev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; - netdev->vlan_rx_register = nes_netdev_vlan_rx_register; netdev->features |= NETIF_F_LLTX; /* Fill in the port structure */ From hal.rosenstock at gmail.com Mon Jan 5 14:16:52 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 5 Jan 2009 17:16:52 -0500 Subject: [ofa-general] Re: [ewg] RE: Agenda for the OFED meeting today (Jan 5, 09) In-Reply-To: <7B9683B6-C37A-4143-A9FD-456686CD448A@cisco.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> <7B9683B6-C37A-4143-A9FD-456686CD448A@cisco.com> Message-ID: Jeff, On Mon, Jan 5, 2009 at 3:26 PM, Jeff Squyres wrote: > I chatted with John and Todd from QL on the phone today -- we basically came > to the same conclusion: > > - need to beef-up opensm to be able to scalably handle lots of incoming path > record lookups This is the most obvious SA scalability issue but there are some others which may be important (related to SA caching rather than SA distribution as an approach). > - need to beef-up the CM clients on the host (maybe; this work might already > be done?) > - need to see the current status of the SA caching stuff / re-open that > discussion to see if the work can be completed, etc. IMO this will aggravate other SA scalability issues as well as there being other limitations with this approach. Don't get me wrong; I'm all for improving the SA scalability; there's no quick solution to this AFAIK. It would be interesting to see an apples to apples comparison of OpenSM and proprietary SMs in terms of running on the same hardware and the transaction rate for various things. I think this warrants an open discussion if people are serious about working on this issue. > It might also be worthwhile to start a whole new discussion about making a > better CM (at least from the ULP perspective). One that offers simple > mechanisms for those who don't need/care about the details, but also offers > complex/detailed mechanisms (perhaps remarkably like today's mechanisms). I've heard similar comments before but this too will take significant where-with-all IMO. -- Hal From hal.rosenstock at gmail.com Mon Jan 5 14:19:40 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 5 Jan 2009 17:19:40 -0500 Subject: [ofa-general] [PATCH] ipoib: failure during startup wiith non-default pkey set. In-Reply-To: References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624757@MNEXMB1.qlogic.org> Message-ID: On Sun, Jan 4, 2009 at 11:39 PM, Roland Dreier wrote: > b) I understand the issue you're trying to fix, but thinking about this, > it seems that rather than picking the first entry in the p_key table > happens to be for the main IPoIB interface, it would be simpler to > understand if we just always used P_Key 0xffff for the main interface or PKey 0x7fff, right ? -- Hal > and let the user create whatever other interfaces desired for other > P_Keys. Then there wouldn't be any race, and the situation would be > easy to understand and manage. > > - R. From jsquyres at cisco.com Mon Jan 5 14:34:34 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 5 Jan 2009 17:34:34 -0500 Subject: [ofa-general] Re: [ewg] RE: Agenda for the OFED meeting today (Jan 5, 09) In-Reply-To: References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> <7B9683B6-C37A-4143-A9FD-456686CD448A@cisco.com> Message-ID: <07F7C35D-C608-4213-98ED-9E13D78EBA0E@cisco.com> On Jan 5, 2009, at 5:16 PM, Hal Rosenstock wrote: > Don't get me wrong; I'm all for improving the SA scalability; there's > no quick solution to this AFAIK. > > It would be interesting to see an apples to apples comparison of > OpenSM and proprietary SMs in terms of running on the same hardware > and the transaction rate for various things. > > I think this warrants an open discussion if people are serious about > working on this issue. Agreed. I agree that this set of issues has come up many times before on the list; it will be interesting to see if anyone will *do* anything about it this time. :-) (obviously, I'm only interested as a consumer of the end result) >> It might also be worthwhile to start a whole new discussion about >> making a >> better CM (at least from the ULP perspective). One that offers simple >> mechanisms for those who don't need/care about the details, but >> also offers >> complex/detailed mechanisms (perhaps remarkably like today's >> mechanisms). > > I've heard similar comments before but this too will take significant > where-with-all IMO. Ditto my above remarks. :-) -- Jeff Squyres Cisco Systems From rdreier at cisco.com Mon Jan 5 21:07:55 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 05 Jan 2009 21:07:55 -0800 Subject: [ofa-general] Re: [PATCH 3/3] infiniband: ipoib convert to net_device_ops In-Reply-To: <20090105132348.7cac98cb@extreme> (Stephen Hemminger's message of "Mon, 5 Jan 2009 13:23:48 -0800") References: <20090105201442.749889072@vyatta.com> <20090105201514.828840735@vyatta.com> <20090105132211.70baefb9@extreme> <20090105132254.052633d7@extreme> <20090105132348.7cac98cb@extreme> Message-ID: thanks a lot for bothering with all my annoying complaints, applied all 3 patches... From jackm at dev.mellanox.co.il Mon Jan 5 22:20:45 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 6 Jan 2009 08:20:45 +0200 Subject: [ofa-general] Re: [PATCH] mlx4_ib: fix for bugzilla 1383 (LSO packet processing) In-Reply-To: References: <200812291223.11753.jackm@dev.mellanox.co.il> Message-ID: <200901060820.45531.jackm@dev.mellanox.co.il> On Monday 05 January 2009 07:03, Roland Dreier wrote: > So do you think my patch (which avoids all the code duplication and > goto) is OK, or is that still too much overhead? Keep in mind the > global impact of a larger I-cache footprint because of the code > duplication... > Sorry for not responding sooner. As Tziporet indicated, we've not yet had a chance to compare performance of the two patches. I'll try to get to this soon. - Jack From dorfman.eli at gmail.com Mon Jan 5 23:42:23 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 06 Jan 2009 09:42:23 +0200 Subject: [ofa-general] [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <20090105184219.GN31213@obsidianresearch.com> References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> <20090105184219.GN31213@obsidianresearch.com> Message-ID: <49630B5F.8050100@gmail.com> Jason Gunthorpe wrote: > On Mon, Jan 05, 2009 at 02:47:19PM +0200, Eli Dorfman (Voltaire) wrote: > >>> Should XMT_WAIT support be added to output_aggregate_perfcounters(), >>> reset and other places too? >> reset is not supported by the firmware (at the moment). >> need to add xmitwait to output_aggregate_perfcounters() as well. > > Yeah, but other devices that support this counter do support reset :) > In that case it should be implemented. Since I don't have these devices it would be difficult for me to test this functionality. From dorfman.eli at gmail.com Mon Jan 5 23:46:15 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 06 Jan 2009 09:46:15 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> Message-ID: <49630C47.4000302@gmail.com> Hal Rosenstock wrote: > Eli, > > On Mon, Jan 5, 2009 at 7:47 AM, Eli Dorfman (Voltaire) > wrote: >> Sasha Khapyorsky wrote: >>> Hi Eli, >>> >>> On 14:58 Tue 23 Dec , Eli Dorfman (Voltaire) wrote: >>>> Add support for PortXmitWait counter >>>> Show PortCounters::PortXmitWait when this capability is supported by the firmware. >>>> If not supported show this counter as 0. >>>> >>>> Signed-off-by: Eli Dorfman >>>> --- >>>> infiniband-diags/src/perfquery.c | 10 +++++++++- >>>> libibmad/include/infiniband/mad.h | 1 + >>>> libibmad/src/fields.c | 1 + >>>> 3 files changed, 11 insertions(+), 1 deletions(-) >>>> >>>> diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c >>>> index 7a53e92..4166fff 100644 >>>> --- a/infiniband-diags/src/perfquery.c >>>> +++ b/infiniband-diags/src/perfquery.c >>>> @@ -68,6 +68,7 @@ struct perf_count { >>>> uint32_t rcvdata; >>>> uint32_t xmtpkts; >>>> uint32_t rcvpkts; >>>> + uint32_t xmtwait; >>>> }; >>>> >>>> struct perf_count_ext { >>>> @@ -210,6 +211,8 @@ static void aggregate_perfcounters(void) >>>> aggregate_32bit(&perf_count.xmtpkts, val); >>>> mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); >>>> aggregate_32bit(&perf_count.rcvpkts, val); >>>> + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); >>>> + aggregate_32bit(&perf_count.xmtwait, val); >>>> } >>> Should XMT_WAIT support be added to output_aggregate_perfcounters(), >>> reset and other places too? >> reset is not supported by the firmware (at the moment). > > What is the firmware response to a reset ? i didn't try this since mellanox say they don't support this. > > -- Hal > >> need to add xmitwait to output_aggregate_perfcounters() as well. >> >>>> static void output_aggregate_perfcounters(ib_portid_t *portid) >>>> @@ -299,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p >>>> if (extended != 1) { >>>> if (!port_performance_query(pc, portid, port, timeout)) >>>> IBERROR("perfquery"); >>>> + if (!(cap_mask & 0x1000)) { >>>> + /* if PortCounters:PortXmitWait not suppported clear this counter */ >>>> + perf_count.xmtwait = 0; >>>> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); >>>> + } >>>> if (aggregate) >>>> aggregate_perfcounters(); >>>> - else >>>> + else >>>> mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); >>>> } else { >>>> if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ >>>> diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h >>>> index c2ad148..6c313f9 100644 >>>> --- a/libibmad/include/infiniband/mad.h >>>> +++ b/libibmad/include/infiniband/mad.h >>>> @@ -413,6 +413,7 @@ enum MAD_FIELDS { >>>> IB_PC_RCV_BYTES_F, >>>> IB_PC_XMT_PKTS_F, >>>> IB_PC_RCV_PKTS_F, >>>> + IB_PC_XMT_WAIT_F, >>>> IB_PC_LAST_F, >>>> >>>> /* >>> Basically I'm fine to have two separate patches - one to support >>> XMT_WAIT in libibmad and another one for perfquery, this is a minor >>> although. >> it is all part of the same change. >> >>> Sasha >>> >>>> diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c >>>> index 6942e85..116e432 100644 >>>> --- a/libibmad/src/fields.c >>>> +++ b/libibmad/src/fields.c >>>> @@ -247,6 +247,7 @@ ib_field_t ib_mad_f [] = { >>>> [IB_PC_RCV_BYTES_F] {224, 32, "RcvData", mad_dump_uint}, >>>> [IB_PC_XMT_PKTS_F] {256, 32, "XmtPkts", mad_dump_uint}, >>>> [IB_PC_RCV_PKTS_F] {288, 32, "RcvPkts", mad_dump_uint}, >>>> + [IB_PC_XMT_WAIT_F] {320, 32, "XmtWait", mad_dump_uint}, >>>> >>>> /* >>>> * SMInfo >>>> -- >>>> 1.5.5 >>>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From nicolas.morey-chaisemartin at ext.bull.net Tue Jan 6 00:37:01 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Tue, 06 Jan 2009 09:37:01 +0100 Subject: [ofa-general] Question about rdma_cm Message-ID: <4963182D.5020105@ext.bull.net> Hello, I'm trying to understand how rdma_cm works. More precisely, how it finds which device (local) and guid/lid (remote) to use. For the local device if I have understood well: -Using ip_dev_find, we get an IP (IPoIB or not) device which has a route to the destination IP. -We use the first port of the first device which has an IP matching. For the remote HCA, I can't workout how is the transition done between the destination IP and the lid. Can anyone enlighten me? Best Regards Nicolas From tziporet at dev.mellanox.co.il Tue Jan 6 02:48:55 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 06 Jan 2009 12:48:55 +0200 Subject: [ofa-general] ConnectX hca problem In-Reply-To: <496238A9.5010605@opengridcomputing.com> References: <496238A9.5010605@opengridcomputing.com> Message-ID: <49633717.7090708@mellanox.co.il> Steve Wise wrote: > Mellanox experts: > > I'm having problems with one of my ConnectX cards. When I load > mlx4_core, it logs the following error. > Any suggestions on how to proceed? > > > Steve > > ------- > > mlx4_core: Mellanox ConnectX core driver v0.01 (May 1, 2007) > mlx4_core: Initializing 0000:0c:00.0 > mlx4_core 0000:0c:00.0: PCI INT A -> GSI 33 (level, low) -> IRQ 33 > mlx4_core 0000:0c:00.0: setting latency timer to 64 > mlx4_core 0000:0c:00.0: RUN_FW command failed, aborting. > mlx4_core 0000:0c:00.0: Failed to start FW, aborting. > mlx4_core 0000:0c:00.0: PCI INT A disabled > mlx4_core: probe of 0000:0c:00.0 failed with error -22 > Which FW version are you running? In latest kernel and OFED 1.4 when RUN_FW fails it should print the error number returned by RUN_FW and in this way we can tell what is the problem Tziporet From vlad at lists.openfabrics.org Tue Jan 6 03:18:20 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 6 Jan 2009 03:18:20 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090106-0200 daily build status Message-ID: <20090106111820.B3482E60C93@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From hal.rosenstock at gmail.com Tue Jan 6 07:25:15 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 6 Jan 2009 10:25:15 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <49630C47.4000302@gmail.com> References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> <49630C47.4000302@gmail.com> Message-ID: On Tue, Jan 6, 2009 at 2:46 AM, Eli Dorfman (Voltaire) wrote: >>> reset is not supported by the firmware (at the moment). >> >> What is the firmware response to a reset ? > > i didn't try this since mellanox say they don't support this. Which Mellanox device (and fw version) ? From tzahio at voltaire.com Tue Jan 6 08:34:43 2009 From: tzahio at voltaire.com (Tzahi Oved) Date: Tue, 6 Jan 2009 18:34:43 +0200 Subject: [ofa-general] ***SPAM*** rdma write to pci address Message-ID: <3290ba6b0901060834w35980b51rcde36dde50e11701@mail.gmail.com> Hi, I'm trying to RDMA write from one server directly to a pci bar address in a remote server. I'm using ConnectX HCA with OFED 1.4 on a DL160 platform. Basically I'm using the following calls: ib_get_dma_mr() - get a general mr for physical bus memory space dma_map_single() - map my ioremap()ed kernel virtual address of the io device to the HCA bus address space Then with mr->rkey and the bus address I'm trying to rdma write from a remote server. The RDMA operation returns with cq.status 10 (rem op err). Any ideas why it fails? should I take a different approach? Many thanks, Tzahi -------------- next part -------------- An HTML attachment was scrubbed... URL: From tzahio at voltaire.com Tue Jan 6 08:42:22 2009 From: tzahio at voltaire.com (Tzahi Oved) Date: Tue, 6 Jan 2009 18:42:22 +0200 Subject: [ofa-general] ***SPAM*** rdma write to pci address In-Reply-To: <3290ba6b0901060834w35980b51rcde36dde50e11701@mail.gmail.com> References: <3290ba6b0901060834w35980b51rcde36dde50e11701@mail.gmail.com> Message-ID: <3290ba6b0901060842q3b77ea52x6a413b5e945a4b1d@mail.gmail.com> Adding a correction: the RDMA operation returns with cq.status 10 (rem access err). On Tue, Jan 6, 2009 at 6:34 PM, Tzahi Oved wrote: > Hi, > I'm trying to RDMA write from one server directly to a pci bar address in a > remote server. > I'm using ConnectX HCA with OFED 1.4 on a DL160 platform. > Basically I'm using the following calls: > ib_get_dma_mr() - get a general mr for physical bus memory space > dma_map_single() - map my ioremap()ed kernel virtual address of the io > device to the HCA bus address space > Then with mr->rkey and the bus address I'm trying to rdma write from a > remote server. > The RDMA operation returns with cq.status 10 (rem op err). > Any ideas why it fails? should I take a different approach? > Many thanks, > Tzahi > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Tue Jan 6 10:37:21 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jan 2009 10:37:21 -0800 Subject: [ofa-general] Question about rdma_cm In-Reply-To: <4963182D.5020105@ext.bull.net> References: <4963182D.5020105@ext.bull.net> Message-ID: <000001c9702d$cfe79630$355b180a@amr.corp.intel.com> >For the remote HCA, I can't workout how is the transition done between >the destination IP and the lid. > >Can anyone enlighten me? The remote IP address is mapped to what IP considers to be the L2 address (typically using ARP). The 'L2' address contains the destination GID. The rdma_resolve_route() call performs a path record query to the SM using the local address (SGID and PKey) and remote address (DGID) to obtain a usable path, including the LIDs. - Sean From davem at davemloft.net Tue Jan 6 10:45:40 2009 From: davem at davemloft.net (David Miller) Date: Tue, 06 Jan 2009 10:45:40 -0800 (PST) Subject: [ofa-general] Re: [PATCH 7/8] infiniband: driver API update In-Reply-To: References: <20090105201442.749889072@vyatta.com> <20090105201514.828840735@vyatta.com> Message-ID: <20090106.104540.111119292.davem@davemloft.net> From: Roland Dreier Date: Mon, 05 Jan 2009 13:02:15 -0800 > Looks good enough, so Dave if you want to merge it, you can add > > Acked-by: Roland Dreier > > or let me know if you want me to pull it in through my tree. A couple > of nits that are probably not worth fixing. First, globally, it might > be slightly nicer to merge this as one patch per module, rather than all > lumped together. And also: This is superceded by the 3 patch set Roland took into his tree so this patch doesn't need to be applied :-) From davem at davemloft.net Tue Jan 6 10:46:45 2009 From: davem at davemloft.net (David Miller) Date: Tue, 06 Jan 2009 10:46:45 -0800 (PST) Subject: [ofa-general] Re: [PATCH 3/3] infiniband: ipoib convert to net_device_ops In-Reply-To: References: <20090105132254.052633d7@extreme> <20090105132348.7cac98cb@extreme> Message-ID: <20090106.104645.165832310.davem@davemloft.net> From: Roland Dreier Date: Mon, 05 Jan 2009 21:07:55 -0800 > thanks a lot for bothering with all my annoying complaints, applied all > 3 patches... Thanks for taking these Roland. From brian at sun.com Tue Jan 6 11:47:21 2009 From: brian at sun.com (Brian J. Murrell) Date: Tue, 06 Jan 2009 14:47:21 -0500 Subject: [ofa-general] building just kernel-ib{-devel} and not being root Message-ID: <1231271241.6441.52.camel@pc.interlinx.bc.ca> I am wondering if there is a generally supported (or otherwise) way of taking a pristine source tarball (i.e. OFED-1.4.tgz) and unpacking and building just the kernel-ib{-devel} packages with a set of option selections. The idea here is to drop something into an automated build system that builds these packages. Previously I have just called rpmbuild on the ofa_kernel.spec file directly with a boatload of options. I'm feeling like this is not quite future-proof and looking for a more supported method of achieving this. One of the caveats is that the normal build and install process of the install.pl won't work for me. I cannot be root on the build system (and therefore cannot install packages). Any ideas? Thanx, b. From devel at morey-chaisemartin.com Tue Jan 6 12:04:35 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Tue, 06 Jan 2009 21:04:35 +0100 Subject: [ofa-general] Question about rdma_cm In-Reply-To: <000001c9702d$cfe79630$355b180a@amr.corp.intel.com> References: <4963182D.5020105@ext.bull.net> <000001c9702d$cfe79630$355b180a@amr.corp.intel.com> Message-ID: <4963B953.1030007@morey-chaisemartin.com> Sean Hefty a écrit : >> For the remote HCA, I can't workout how is the transition done between >> the destination IP and the lid. >> >> Can anyone enlighten me? >> > > The remote IP address is mapped to what IP considers to be the L2 address > (typically using ARP). The 'L2' address contains the destination GID. The > rdma_resolve_route() call performs a path record query to the SM using the local > address (SGID and PKey) and remote address (DGID) to obtain a usable path, > including the LIDs. > > - Sean > > Hi, Thanks for your answer. I saw the DGID to path/LID earlier, but I didn't see where and how exactly the GID is obtained. I also saw how the GID is obtained from the L2-like address, but I'm still not sure how the L2-like address is obtained. Is there an ARP like protocol implemented on IB? Anyway thanks for your answers Best Regards Nicolas Morey-Chaisemartin From sean.hefty at intel.com Tue Jan 6 12:19:00 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jan 2009 12:19:00 -0800 Subject: [ofa-general] Question about rdma_cm In-Reply-To: <4963B953.1030007@morey-chaisemartin.com> References: <4963182D.5020105@ext.bull.net> <000001c9702d$cfe79630$355b180a@amr.corp.intel.com> <4963B953.1030007@morey-chaisemartin.com> Message-ID: <000101c9703c$031a1740$355b180a@amr.corp.intel.com> >Thanks for your answer. >I saw the DGID to path/LID earlier, but I didn't see where and how >exactly the GID is obtained. If you're tracing through the code, look at rdma_resolve_addr() calling rdma_resolve_ip(). The latter is in the ib_addr module (addr.c). rdma_resolve_ip() calls addr_resolve_remote() and addr_send_arp(). >I also saw how the GID is obtained from the L2-like address, but I'm >still not sure how the L2-like address is obtained. >Is there an ARP like protocol implemented on IB? yes - For IPoIB to communicate with a remote system, it needs to map the destination IP address to a usable IB address (DGID, PKey, QPN), plus perform a path record query to get the LIDs. The rdma cm follows the same concept and relies on the network stack to provide it with the correct information. - Sean From tziporet at mellanox.co.il Tue Jan 6 13:41:31 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 6 Jan 2009 23:41:31 +0200 Subject: [ofa-general] OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0164F1F0@mtlexch01.mtl.com> OFED Jan 5, 2009 meeting minutes on future plans ===================================== Meeting minutes on the web: http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes Meeting Summary: ============== 1. OFED 1.4.1 release: We will look into it on middle of February 2. OFED 1.5: - Release target date is July 09 - Kernel base will be 2.6.29 - Features list - not closed yet Details: ====== 1. Conclusions from OFED 1.4 release: We got several comments from Doug (Redhat): - ammaso driver is not compiling - we should remove it from the kernel sources since its not compile - IBM ehca driver - does not compile on RHEL4 - Improvements for the build system - grep for undefined functions and report this since they will fail when trying to load the module Other conclusions: - Real feature freeze is only after we are on the right kernel base - We usually have at least 6 RCs - need to plan for this on the schedule 2. Do we wish to have OFED 1.4. 1: Pros: We can add the following features: - RDS iWARP - Steve Wise - NFS/RDMA backports - Steve Wise - Open MPI 1.3 - Jeff S. - Supporting new OSes: RHEL 5.3, SLES 11 - Fixes of fatal bugs - if found Cons: - Having 1.4.1 release will delay 1.5 schedule for one month or more. - A large QA effort. Decision: Since no fatal bugs were reported so far, we decide to revisit this in middle of Feb. Meanwhile Steve can work on RDS iWARP support and NFS/RDMA backports on the 1.4.1 branch. People that wish to work with Open MPI 1.3 can download it from its site. 3. OFED 1.5: Schedule and features. Schedule: Release is planned for July. (assuming no 1.4.1 release) Feature Freeze: 4/20/09 Alpha Release: 4/20/09 Beta Release: 5/20/09 RC1: 5/05/09 RC2: 5/19/09 RC3: 6/02/09 RC4: 6/16/09 RC5: 6/30/09 RC6: 7/14/09 Release: 7/28/09 Features: * Kernel.org: 2.6.28 and 2.6.29 * Multiple Event Queues to support Multi-core CPUs * NFS/RDMA - GA * RDS support for iWARP * OpenMPI 1.3 * Add support/backports for RedHat EL 5.3 and EL 4.8, SLES 11 * Support for Mellanox vNIC (EoIB) and FCoIB with BridgeX device * SDP - performance improvements * Mellanox suggested to add IB over Eth - this is similar to iWARP but more like IB (e.g. including UD), and can work over ConnectX. A concern was raised by Intel (Dave Sommers) since it is not a standard transport. Decision: This request will be raised in the MWG, and they should decide if OFA can support it. More discussions on 1.5 features will be done in next meeting. Tziporet From sean.hefty at intel.com Tue Jan 6 13:54:28 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jan 2009 13:54:28 -0800 Subject: [ofa-general] RE: OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD0164F1F0@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> <5D49E7A8952DC44FB38C38FA0D758EAD0164F1F0@mtlexch01.mtl.com> Message-ID: <000201c97049$59960fe0$355b180a@amr.corp.intel.com> >* Mellanox suggested to add IB over Eth - this is similar to iWARP but >more like IB (e.g. including UD), and can work over ConnectX. >A concern was raised by Intel (Dave Sommers) since it is not a standard >transport. >Decision: This request will be raised in the MWG, and they should decide >if OFA can support it. Just is just my opinion, but in the past, OFED has included non-standard features, like extended connected mode, that are still not part of the IBTA spec. Do we know if such a feature would be accepted into the Linux kernel? I think OFED should base their decision more on the answer to that question than IBTA approval. - Sean From sean.hefty at intel.com Tue Jan 6 13:59:38 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 6 Jan 2009 13:59:38 -0800 Subject: ***SPAM*** Re: [ofa-general] [RDMA CM IPv6 PATCHv7 2/2] RDMA CM In-Reply-To: <579253.6715.qm@web58304.mail.re3.yahoo.com> References: <579253.6715.qm@web58304.mail.re3.yahoo.com> Message-ID: <000301c9704a$11fb6a30$355b180a@amr.corp.intel.com> >To return the error on the active side that "this iWARP device doesn't support >IPv6", > >the following places are the possibility > >-rdma_connect() > >-rdma_resolve_addr() > >-qp setup time > > > >rdma_resolve_addr() would probably be the earliest to handle this. I believe this is the proper place to check for this. Does the code with the IPv6 support changes not handle this? - Sean From jim.ryan at intel.com Tue Jan 6 14:00:55 2009 From: jim.ryan at intel.com (Ryan, Jim) Date: Tue, 6 Jan 2009 14:00:55 -0800 Subject: [ofa-general] RE: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <000201c97049$59960fe0$355b180a@amr.corp.intel.com> Message-ID: <3F6F638B8D880340AB536D29CD4C1E192F7F0BDF@orsmsx501.amr.corp.intel.com> Sean, I think that's a good point. What it suggests to me is asking when someone proposes a "non-standard" feature, what process, procedures, documentation, support, etc. if any, should be made available by the entity making the proposal? It seems to me asking the same questions of all proposed features is fair and reasonable, and shouldn't represent an unreasonable barrier to progress. Thoughts? If this already exists, it's my ignorance and I will apologize in advance Thanks again, Jim -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Sean Hefty Sent: Tuesday, January 06, 2009 1:54 PM To: 'Tziporet Koren'; ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans >* Mellanox suggested to add IB over Eth - this is similar to iWARP but >more like IB (e.g. including UD), and can work over ConnectX. >A concern was raised by Intel (Dave Sommers) since it is not a standard >transport. >Decision: This request will be raised in the MWG, and they should decide >if OFA can support it. Just is just my opinion, but in the past, OFED has included non-standard features, like extended connected mode, that are still not part of the IBTA spec. Do we know if such a feature would be accepted into the Linux kernel? I think OFED should base their decision more on the answer to that question than IBTA approval. - Sean _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From sfr at canb.auug.org.au Tue Jan 6 16:35:09 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 7 Jan 2009 11:35:09 +1100 Subject: [ofa-general] [PATCH] powerpc: cleanup from powerpc l64 to ll64 change: drivers/infiniband Message-ID: <20090107113509.14496aa9.sfr@canb.auug.org.au> This is a powerpc specific driver. Signed-off-by: Stephen Rothwell --- drivers/infiniband/hw/ehca/ehca_cq.c | 16 ++-- drivers/infiniband/hw/ehca/ehca_hca.c | 2 +- drivers/infiniband/hw/ehca/ehca_irq.c | 18 ++-- drivers/infiniband/hw/ehca/ehca_main.c | 6 +- drivers/infiniband/hw/ehca/ehca_mcast.c | 4 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 144 +++++++++++++++--------------- drivers/infiniband/hw/ehca/ehca_qp.c | 32 ++++---- drivers/infiniband/hw/ehca/ehca_reqs.c | 2 +- drivers/infiniband/hw/ehca/ehca_sqp.c | 2 +- drivers/infiniband/hw/ehca/ehca_tools.h | 2 +- drivers/infiniband/hw/ehca/ehca_uverbs.c | 2 +- drivers/infiniband/hw/ehca/hcp_if.c | 30 +++--- 12 files changed, 130 insertions(+), 130 deletions(-) This patch on its own will generate lot of warnings - it depends on the powerpc architecture changing from l64 to ll64 i.e. u64 becomes an "unsigned long long" instead of "unsigned long". That patch is pending in the powerpc arch queue. It might be easier for someone to just ack this patch and it be routed through the powerpc tree. diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 2f4c28a..26efbad 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -196,7 +196,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (h_ret != H_SUCCESS) { ehca_err(device, "hipz_h_alloc_resource_cq() failed " - "h_ret=%li device=%p", h_ret, device); + "h_ret=%lli device=%p", h_ret, device); cq = ERR_PTR(ehca2ib_return_code(h_ret)); goto create_cq_exit2; } @@ -232,7 +232,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (h_ret < H_SUCCESS) { ehca_err(device, "hipz_h_register_rpage_cq() failed " - "ehca_cq=%p cq_num=%x h_ret=%li counter=%i " + "ehca_cq=%p cq_num=%x h_ret=%lli counter=%i " "act_pages=%i", my_cq, my_cq->cq_number, h_ret, counter, param.act_pages); cq = ERR_PTR(-EINVAL); @@ -244,7 +244,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if ((h_ret != H_SUCCESS) || vpage) { ehca_err(device, "Registration of pages not " "complete ehca_cq=%p cq_num=%x " - "h_ret=%li", my_cq, my_cq->cq_number, + "h_ret=%lli", my_cq, my_cq->cq_number, h_ret); cq = ERR_PTR(-EAGAIN); goto create_cq_exit4; @@ -252,7 +252,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, } else { if (h_ret != H_PAGE_REGISTERED) { ehca_err(device, "Registration of page failed " - "ehca_cq=%p cq_num=%x h_ret=%li " + "ehca_cq=%p cq_num=%x h_ret=%lli " "counter=%i act_pages=%i", my_cq, my_cq->cq_number, h_ret, counter, param.act_pages); @@ -266,7 +266,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, gal = my_cq->galpas.kernel; cqx_fec = hipz_galpa_load(gal, CQTEMM_OFFSET(cqx_fec)); - ehca_dbg(device, "ehca_cq=%p cq_num=%x CQX_FEC=%lx", + ehca_dbg(device, "ehca_cq=%p cq_num=%x CQX_FEC=%llx", my_cq, my_cq->cq_number, cqx_fec); my_cq->ib_cq.cqe = my_cq->nr_of_entries = @@ -307,7 +307,7 @@ create_cq_exit3: h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 1); if (h_ret != H_SUCCESS) ehca_err(device, "hipz_h_destroy_cq() failed ehca_cq=%p " - "cq_num=%x h_ret=%li", my_cq, my_cq->cq_number, h_ret); + "cq_num=%x h_ret=%lli", my_cq, my_cq->cq_number, h_ret); create_cq_exit2: write_lock_irqsave(&ehca_cq_idr_lock, flags); @@ -355,7 +355,7 @@ int ehca_destroy_cq(struct ib_cq *cq) h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); if (h_ret == H_R_STATE) { /* cq in err: read err data and destroy it forcibly */ - ehca_dbg(device, "ehca_cq=%p cq_num=%x ressource=%lx in err " + ehca_dbg(device, "ehca_cq=%p cq_num=%x ressource=%llx in err " "state. Try to delete it forcibly.", my_cq, cq_num, my_cq->ipz_cq_handle.handle); ehca_error_data(shca, my_cq, my_cq->ipz_cq_handle.handle); @@ -365,7 +365,7 @@ int ehca_destroy_cq(struct ib_cq *cq) cq_num); } if (h_ret != H_SUCCESS) { - ehca_err(device, "hipz_h_destroy_cq() failed h_ret=%li " + ehca_err(device, "hipz_h_destroy_cq() failed h_ret=%lli " "ehca_cq=%p cq_num=%x", h_ret, my_cq, cq_num); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 4628822..9209c53 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -393,7 +393,7 @@ int ehca_modify_port(struct ib_device *ibdev, hret = hipz_h_modify_port(shca->ipz_hca_handle, port, cap, props->init_type, port_modify_mask); if (hret != H_SUCCESS) { - ehca_err(&shca->ib_device, "Modify port failed h_ret=%li", + ehca_err(&shca->ib_device, "Modify port failed h_ret=%lli", hret); ret = -EINVAL; } diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 757035e..9719e1a 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -99,7 +99,7 @@ static void print_error_data(struct ehca_shca *shca, void *data, return; ehca_err(&shca->ib_device, - "QP 0x%x (resource=%lx) has errors.", + "QP 0x%x (resource=%llx) has errors.", qp->ib_qp.qp_num, resource); break; } @@ -108,21 +108,21 @@ static void print_error_data(struct ehca_shca *shca, void *data, struct ehca_cq *cq = (struct ehca_cq *)data; ehca_err(&shca->ib_device, - "CQ 0x%x (resource=%lx) has errors.", + "CQ 0x%x (resource=%llx) has errors.", cq->cq_number, resource); break; } default: ehca_err(&shca->ib_device, - "Unknown error type: %lx on %s.", + "Unknown error type: %llx on %s.", type, shca->ib_device.name); break; } - ehca_err(&shca->ib_device, "Error data is available: %lx.", resource); + ehca_err(&shca->ib_device, "Error data is available: %llx.", resource); ehca_err(&shca->ib_device, "EHCA ----- error data begin " "---------------------------------------------------"); - ehca_dmp(rblock, length, "resource=%lx", resource); + ehca_dmp(rblock, length, "resource=%llx", resource); ehca_err(&shca->ib_device, "EHCA ----- error data end " "----------------------------------------------------"); @@ -152,7 +152,7 @@ int ehca_error_data(struct ehca_shca *shca, void *data, if (ret == H_R_STATE) ehca_err(&shca->ib_device, - "No error data is available: %lx.", resource); + "No error data is available: %llx.", resource); else if (ret == H_SUCCESS) { int length; @@ -164,7 +164,7 @@ int ehca_error_data(struct ehca_shca *shca, void *data, print_error_data(shca, data, rblock, length); } else ehca_err(&shca->ib_device, - "Error data could not be fetched: %lx", resource); + "Error data could not be fetched: %llx", resource); ehca_free_fw_ctrlblock(rblock); @@ -514,7 +514,7 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe) struct ehca_cq *cq; eqe_value = eqe->entry; - ehca_dbg(&shca->ib_device, "eqe_value=%lx", eqe_value); + ehca_dbg(&shca->ib_device, "eqe_value=%llx", eqe_value); if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, eqe_value)) { ehca_dbg(&shca->ib_device, "Got completion event"); token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value); @@ -603,7 +603,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) ret = hipz_h_eoi(eq->ist); if (ret != H_SUCCESS) ehca_err(&shca->ib_device, - "bad return code EOI -rc = %ld\n", ret); + "bad return code EOI -rc = %lld\n", ret); ehca_dbg(&shca->ib_device, "deadman found %x eqe", eqe_cnt); } if (unlikely(eqe_cnt == EHCA_EQE_CACHE_SIZE)) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index c7b8a50..368311c 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -304,7 +304,7 @@ static int ehca_sense_attributes(struct ehca_shca *shca) h_ret = hipz_h_query_hca(shca->ipz_hca_handle, rblock); if (h_ret != H_SUCCESS) { - ehca_gen_err("Cannot query device properties. h_ret=%li", + ehca_gen_err("Cannot query device properties. h_ret=%lli", h_ret); ret = -EPERM; goto sense_attributes1; @@ -391,7 +391,7 @@ static int ehca_sense_attributes(struct ehca_shca *shca) port = (struct hipz_query_port *)rblock; h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); if (h_ret != H_SUCCESS) { - ehca_gen_err("Cannot query port properties. h_ret=%li", + ehca_gen_err("Cannot query port properties. h_ret=%lli", h_ret); ret = -EPERM; goto sense_attributes1; @@ -682,7 +682,7 @@ static ssize_t ehca_show_adapter_handle(struct device *dev, { struct ehca_shca *shca = dev->driver_data; - return sprintf(buf, "%lx\n", shca->ipz_hca_handle.handle); + return sprintf(buf, "%llx\n", shca->ipz_hca_handle.handle); } static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); diff --git a/drivers/infiniband/hw/ehca/ehca_mcast.c b/drivers/infiniband/hw/ehca/ehca_mcast.c index e3ef026..120aedf 100644 --- a/drivers/infiniband/hw/ehca/ehca_mcast.c +++ b/drivers/infiniband/hw/ehca/ehca_mcast.c @@ -88,7 +88,7 @@ int ehca_attach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) if (h_ret != H_SUCCESS) ehca_err(ibqp->device, "ehca_qp=%p qp_num=%x hipz_h_attach_mcqp() failed " - "h_ret=%li", my_qp, ibqp->qp_num, h_ret); + "h_ret=%lli", my_qp, ibqp->qp_num, h_ret); return ehca2ib_return_code(h_ret); } @@ -125,7 +125,7 @@ int ehca_detach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) if (h_ret != H_SUCCESS) ehca_err(ibqp->device, "ehca_qp=%p qp_num=%x hipz_h_detach_mcqp() failed " - "h_ret=%li", my_qp, ibqp->qp_num, h_ret); + "h_ret=%lli", my_qp, ibqp->qp_num, h_ret); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index f974367..72f83f7 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -204,7 +204,7 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, } if ((size == 0) || (((u64)iova_start + size) < (u64)iova_start)) { - ehca_err(pd->device, "bad input values: size=%lx iova_start=%p", + ehca_err(pd->device, "bad input values: size=%llx iova_start=%p", size, iova_start); ib_mr = ERR_PTR(-EINVAL); goto reg_phys_mr_exit0; @@ -309,8 +309,8 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, } if (length == 0 || virt + length < virt) { - ehca_err(pd->device, "bad input values: length=%lx " - "virt_base=%lx", length, virt); + ehca_err(pd->device, "bad input values: length=%llx " + "virt_base=%llx", length, virt); ib_mr = ERR_PTR(-EINVAL); goto reg_user_mr_exit0; } @@ -373,7 +373,7 @@ reg_user_mr_fallback: &e_mr->ib.ib_mr.rkey); if (ret == -EINVAL && pginfo.hwpage_size > PAGE_SIZE) { ehca_warn(pd->device, "failed to register mr " - "with hwpage_size=%lx", hwpage_size); + "with hwpage_size=%llx", hwpage_size); ehca_info(pd->device, "try to register mr with " "kpage_size=%lx", PAGE_SIZE); /* @@ -509,7 +509,7 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, goto rereg_phys_mr_exit1; if ((new_size == 0) || (((u64)iova_start + new_size) < (u64)iova_start)) { - ehca_err(mr->device, "bad input values: new_size=%lx " + ehca_err(mr->device, "bad input values: new_size=%llx " "iova_start=%p", new_size, iova_start); ret = -EINVAL; goto rereg_phys_mr_exit1; @@ -580,8 +580,8 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) h_ret = hipz_h_query_mr(shca->ipz_hca_handle, e_mr, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(mr->device, "hipz_mr_query failed, h_ret=%li mr=%p " - "hca_hndl=%lx mr_hndl=%lx lkey=%x", + ehca_err(mr->device, "hipz_mr_query failed, h_ret=%lli mr=%p " + "hca_hndl=%llx mr_hndl=%llx lkey=%x", h_ret, mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, mr->lkey); ret = ehca2ib_return_code(h_ret); @@ -630,8 +630,8 @@ int ehca_dereg_mr(struct ib_mr *mr) /* TODO: BUSY: MR still has bound window(s) */ h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { - ehca_err(mr->device, "hipz_free_mr failed, h_ret=%li shca=%p " - "e_mr=%p hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", + ehca_err(mr->device, "hipz_free_mr failed, h_ret=%lli shca=%p " + "e_mr=%p hca_hndl=%llx mr_hndl=%llx mr->lkey=%x", h_ret, shca, e_mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, mr->lkey); ret = ehca2ib_return_code(h_ret); @@ -671,8 +671,8 @@ struct ib_mw *ehca_alloc_mw(struct ib_pd *pd) h_ret = hipz_h_alloc_resource_mw(shca->ipz_hca_handle, e_mw, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(pd->device, "hipz_mw_allocate failed, h_ret=%li " - "shca=%p hca_hndl=%lx mw=%p", + ehca_err(pd->device, "hipz_mw_allocate failed, h_ret=%lli " + "shca=%p hca_hndl=%llx mw=%p", h_ret, shca, shca->ipz_hca_handle.handle, e_mw); ib_mw = ERR_PTR(ehca2ib_return_code(h_ret)); goto alloc_mw_exit1; @@ -713,8 +713,8 @@ int ehca_dealloc_mw(struct ib_mw *mw) h_ret = hipz_h_free_resource_mw(shca->ipz_hca_handle, e_mw); if (h_ret != H_SUCCESS) { - ehca_err(mw->device, "hipz_free_mw failed, h_ret=%li shca=%p " - "mw=%p rkey=%x hca_hndl=%lx mw_hndl=%lx", + ehca_err(mw->device, "hipz_free_mw failed, h_ret=%lli shca=%p " + "mw=%p rkey=%x hca_hndl=%llx mw_hndl=%llx", h_ret, shca, mw, mw->rkey, shca->ipz_hca_handle.handle, e_mw->ipz_mw_handle.handle); return ehca2ib_return_code(h_ret); @@ -840,7 +840,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, goto map_phys_fmr_exit0; if (iova % e_fmr->fmr_page_size) { /* only whole-numbered pages */ - ehca_err(fmr->device, "bad iova, iova=%lx fmr_page_size=%x", + ehca_err(fmr->device, "bad iova, iova=%llx fmr_page_size=%x", iova, e_fmr->fmr_page_size); ret = -EINVAL; goto map_phys_fmr_exit0; @@ -878,7 +878,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, map_phys_fmr_exit0: if (ret) ehca_err(fmr->device, "ret=%i fmr=%p page_list=%p list_len=%x " - "iova=%lx", ret, fmr, page_list, list_len, iova); + "iova=%llx", ret, fmr, page_list, list_len, iova); return ret; } /* end ehca_map_phys_fmr() */ @@ -964,8 +964,8 @@ int ehca_dealloc_fmr(struct ib_fmr *fmr) h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); if (h_ret != H_SUCCESS) { - ehca_err(fmr->device, "hipz_free_mr failed, h_ret=%li e_fmr=%p " - "hca_hndl=%lx fmr_hndl=%lx fmr->lkey=%x", + ehca_err(fmr->device, "hipz_free_mr failed, h_ret=%lli e_fmr=%p " + "hca_hndl=%llx fmr_hndl=%llx fmr->lkey=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, fmr->lkey); ret = ehca2ib_return_code(h_ret); @@ -1007,8 +1007,8 @@ int ehca_reg_mr(struct ehca_shca *shca, (u64)iova_start, size, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_alloc_mr failed, h_ret=%li " - "hca_hndl=%lx", h_ret, shca->ipz_hca_handle.handle); + ehca_err(&shca->ib_device, "hipz_alloc_mr failed, h_ret=%lli " + "hca_hndl=%llx", h_ret, shca->ipz_hca_handle.handle); ret = ehca2ib_return_code(h_ret); goto ehca_reg_mr_exit0; } @@ -1033,9 +1033,9 @@ int ehca_reg_mr(struct ehca_shca *shca, ehca_reg_mr_exit1: h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "h_ret=%li shca=%p e_mr=%p " - "iova_start=%p size=%lx acl=%x e_pd=%p lkey=%x " - "pginfo=%p num_kpages=%lx num_hwpages=%lx ret=%i", + ehca_err(&shca->ib_device, "h_ret=%lli shca=%p e_mr=%p " + "iova_start=%p size=%llx acl=%x e_pd=%p lkey=%x " + "pginfo=%p num_kpages=%llx num_hwpages=%llx ret=%i", h_ret, shca, e_mr, iova_start, size, acl, e_pd, hipzout.lkey, pginfo, pginfo->num_kpages, pginfo->num_hwpages, ret); @@ -1045,8 +1045,8 @@ ehca_reg_mr_exit1: ehca_reg_mr_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p " - "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p " - "num_kpages=%lx num_hwpages=%lx", + "iova_start=%p size=%llx acl=%x e_pd=%p pginfo=%p " + "num_kpages=%llx num_hwpages=%llx", ret, shca, e_mr, iova_start, size, acl, e_pd, pginfo, pginfo->num_kpages, pginfo->num_hwpages); return ret; @@ -1116,8 +1116,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, */ if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "last " - "hipz_reg_rpage_mr failed, h_ret=%li " - "e_mr=%p i=%x hca_hndl=%lx mr_hndl=%lx" + "hipz_reg_rpage_mr failed, h_ret=%lli " + "e_mr=%p i=%x hca_hndl=%llx mr_hndl=%llx" " lkey=%x", h_ret, e_mr, i, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, @@ -1128,8 +1128,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, ret = 0; } else if (h_ret != H_PAGE_REGISTERED) { ehca_err(&shca->ib_device, "hipz_reg_rpage_mr failed, " - "h_ret=%li e_mr=%p i=%x lkey=%x hca_hndl=%lx " - "mr_hndl=%lx", h_ret, e_mr, i, + "h_ret=%lli e_mr=%p i=%x lkey=%x hca_hndl=%llx " + "mr_hndl=%llx", h_ret, e_mr, i, e_mr->ib.ib_mr.lkey, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle); @@ -1145,7 +1145,7 @@ ehca_reg_mr_rpages_exit1: ehca_reg_mr_rpages_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p pginfo=%p " - "num_kpages=%lx num_hwpages=%lx", ret, shca, e_mr, + "num_kpages=%llx num_hwpages=%llx", ret, shca, e_mr, pginfo, pginfo->num_kpages, pginfo->num_hwpages); return ret; } /* end ehca_reg_mr_rpages() */ @@ -1184,7 +1184,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, ret = ehca_set_pagebuf(pginfo, pginfo->num_hwpages, kpage); if (ret) { ehca_err(&shca->ib_device, "set pagebuf failed, e_mr=%p " - "pginfo=%p type=%x num_kpages=%lx num_hwpages=%lx " + "pginfo=%p type=%x num_kpages=%llx num_hwpages=%llx " "kpage=%p", e_mr, pginfo, pginfo->type, pginfo->num_kpages, pginfo->num_hwpages, kpage); goto ehca_rereg_mr_rereg1_exit1; @@ -1205,13 +1205,13 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, * (MW bound or MR is shared) */ ehca_warn(&shca->ib_device, "hipz_h_reregister_pmr failed " - "(Rereg1), h_ret=%li e_mr=%p", h_ret, e_mr); + "(Rereg1), h_ret=%lli e_mr=%p", h_ret, e_mr); *pginfo = pginfo_save; ret = -EAGAIN; } else if ((u64 *)hipzout.vaddr != iova_start) { ehca_err(&shca->ib_device, "PHYP changed iova_start in " - "rereg_pmr, iova_start=%p iova_start_out=%lx e_mr=%p " - "mr_handle=%lx lkey=%x lkey_out=%x", iova_start, + "rereg_pmr, iova_start=%p iova_start_out=%llx e_mr=%p " + "mr_handle=%llx lkey=%x lkey_out=%x", iova_start, hipzout.vaddr, e_mr, e_mr->ipz_mr_handle.handle, e_mr->ib.ib_mr.lkey, hipzout.lkey); ret = -EFAULT; @@ -1235,7 +1235,7 @@ ehca_rereg_mr_rereg1_exit1: ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) ehca_err(&shca->ib_device, "ret=%i lkey=%x rkey=%x " - "pginfo=%p num_kpages=%lx num_hwpages=%lx", + "pginfo=%p num_kpages=%llx num_hwpages=%llx", ret, *lkey, *rkey, pginfo, pginfo->num_kpages, pginfo->num_hwpages); return ret; @@ -1263,7 +1263,7 @@ int ehca_rereg_mr(struct ehca_shca *shca, (e_mr->num_hwpages > MAX_RPAGES) || (pginfo->num_hwpages > e_mr->num_hwpages)) { ehca_dbg(&shca->ib_device, "Rereg3 case, " - "pginfo->num_hwpages=%lx e_mr->num_hwpages=%x", + "pginfo->num_hwpages=%llx e_mr->num_hwpages=%x", pginfo->num_hwpages, e_mr->num_hwpages); rereg_1_hcall = 0; rereg_3_hcall = 1; @@ -1295,7 +1295,7 @@ int ehca_rereg_mr(struct ehca_shca *shca, h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_free_mr failed, " - "h_ret=%li e_mr=%p hca_hndl=%lx mr_hndl=%lx " + "h_ret=%lli e_mr=%p hca_hndl=%llx mr_hndl=%llx " "mr->lkey=%x", h_ret, e_mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, @@ -1328,8 +1328,8 @@ int ehca_rereg_mr(struct ehca_shca *shca, ehca_rereg_mr_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p " - "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p " - "num_kpages=%lx lkey=%x rkey=%x rereg_1_hcall=%x " + "iova_start=%p size=%llx acl=%x e_pd=%p pginfo=%p " + "num_kpages=%llx lkey=%x rkey=%x rereg_1_hcall=%x " "rereg_3_hcall=%x", ret, shca, e_mr, iova_start, size, acl, e_pd, pginfo, pginfo->num_kpages, *lkey, *rkey, rereg_1_hcall, rereg_3_hcall); @@ -1371,8 +1371,8 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, * FMRs are not shared and no MW bound to FMRs */ ehca_err(&shca->ib_device, "hipz_reregister_pmr failed " - "(Rereg1), h_ret=%li e_fmr=%p hca_hndl=%lx " - "mr_hndl=%lx lkey=%x lkey_out=%x", + "(Rereg1), h_ret=%lli e_fmr=%p hca_hndl=%llx " + "mr_hndl=%llx lkey=%x lkey_out=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, e_fmr->ib.ib_fmr.lkey, hipzout.lkey); @@ -1383,7 +1383,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_free_mr failed, " - "h_ret=%li e_fmr=%p hca_hndl=%lx mr_hndl=%lx " + "h_ret=%lli e_fmr=%p hca_hndl=%llx mr_hndl=%llx " "lkey=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, @@ -1447,9 +1447,9 @@ int ehca_reg_smr(struct ehca_shca *shca, (u64)iova_start, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%li " + ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%lli " "shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x " - "e_pd=%p hca_hndl=%lx mr_hndl=%lx lkey=%x", + "e_pd=%p hca_hndl=%llx mr_hndl=%llx lkey=%x", h_ret, shca, e_origmr, e_newmr, iova_start, acl, e_pd, shca->ipz_hca_handle.handle, e_origmr->ipz_mr_handle.handle, @@ -1527,7 +1527,7 @@ int ehca_reg_internal_maxmr( &e_mr->ib.ib_mr.rkey); if (ret) { ehca_err(&shca->ib_device, "reg of internal max MR failed, " - "e_mr=%p iova_start=%p size_maxmr=%lx num_kpages=%x " + "e_mr=%p iova_start=%p size_maxmr=%llx num_kpages=%x " "num_hwpages=%x", e_mr, iova_start, size_maxmr, num_kpages, num_hwpages); goto ehca_reg_internal_maxmr_exit1; @@ -1573,8 +1573,8 @@ int ehca_reg_maxmr(struct ehca_shca *shca, (u64)iova_start, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%li " - "e_origmr=%p hca_hndl=%lx mr_hndl=%lx lkey=%x", + ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%lli " + "e_origmr=%p hca_hndl=%llx mr_hndl=%llx lkey=%x", h_ret, e_origmr, shca->ipz_hca_handle.handle, e_origmr->ipz_mr_handle.handle, e_origmr->ib.ib_mr.lkey); @@ -1651,28 +1651,28 @@ int ehca_mr_chk_buf_and_calc_size(struct ib_phys_buf *phys_buf_array, /* check first buffer */ if (((u64)iova_start & ~PAGE_MASK) != (pbuf->addr & ~PAGE_MASK)) { ehca_gen_err("iova_start/addr mismatch, iova_start=%p " - "pbuf->addr=%lx pbuf->size=%lx", + "pbuf->addr=%llx pbuf->size=%llx", iova_start, pbuf->addr, pbuf->size); return -EINVAL; } if (((pbuf->addr + pbuf->size) % PAGE_SIZE) && (num_phys_buf > 1)) { - ehca_gen_err("addr/size mismatch in 1st buf, pbuf->addr=%lx " - "pbuf->size=%lx", pbuf->addr, pbuf->size); + ehca_gen_err("addr/size mismatch in 1st buf, pbuf->addr=%llx " + "pbuf->size=%llx", pbuf->addr, pbuf->size); return -EINVAL; } for (i = 0; i < num_phys_buf; i++) { if ((i > 0) && (pbuf->addr % PAGE_SIZE)) { - ehca_gen_err("bad address, i=%x pbuf->addr=%lx " - "pbuf->size=%lx", + ehca_gen_err("bad address, i=%x pbuf->addr=%llx " + "pbuf->size=%llx", i, pbuf->addr, pbuf->size); return -EINVAL; } if (((i > 0) && /* not 1st */ (i < (num_phys_buf - 1)) && /* not last */ (pbuf->size % PAGE_SIZE)) || (pbuf->size == 0)) { - ehca_gen_err("bad size, i=%x pbuf->size=%lx", + ehca_gen_err("bad size, i=%x pbuf->size=%llx", i, pbuf->size); return -EINVAL; } @@ -1705,7 +1705,7 @@ int ehca_fmr_check_page_list(struct ehca_mr *e_fmr, page = page_list; for (i = 0; i < list_len; i++) { if (*page % e_fmr->fmr_page_size) { - ehca_gen_err("bad page, i=%x *page=%lx page=%p fmr=%p " + ehca_gen_err("bad page, i=%x *page=%llx page=%p fmr=%p " "fmr_page_size=%x", i, *page, page, e_fmr, e_fmr->fmr_page_size); return -EINVAL; @@ -1743,9 +1743,9 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, (pginfo->next_hwpage * pginfo->hwpage_size)); if ( !(*kpage) ) { - ehca_gen_err("pgaddr=%lx " - "chunk->page_list[i]=%lx " - "i=%x next_hwpage=%lx", + ehca_gen_err("pgaddr=%llx " + "chunk->page_list[i]=%llx " + "i=%x next_hwpage=%llx", pgaddr, (u64)sg_dma_address( &chunk->page_list[i]), i, pginfo->next_hwpage); @@ -1795,11 +1795,11 @@ static int ehca_check_kpages_per_ate(struct scatterlist *page_list, for (t = start_idx; t <= end_idx; t++) { u64 pgaddr = page_to_pfn(sg_page(&page_list[t])) << PAGE_SHIFT; if (ehca_debug_level >= 3) - ehca_gen_dbg("chunk_page=%lx value=%016lx", pgaddr, + ehca_gen_dbg("chunk_page=%llx value=%016llx", pgaddr, *(u64 *)abs_to_virt(phys_to_abs(pgaddr))); if (pgaddr - PAGE_SIZE != *prev_pgaddr) { - ehca_gen_err("uncontiguous page found pgaddr=%lx " - "prev_pgaddr=%lx page_list_i=%x", + ehca_gen_err("uncontiguous page found pgaddr=%llx " + "prev_pgaddr=%llx page_list_i=%x", pgaddr, *prev_pgaddr, t); return -EINVAL; } @@ -1833,7 +1833,7 @@ static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo, << PAGE_SHIFT ); *kpage = phys_to_abs(pgaddr); if ( !(*kpage) ) { - ehca_gen_err("pgaddr=%lx i=%x", + ehca_gen_err("pgaddr=%llx i=%x", pgaddr, i); ret = -EFAULT; return ret; @@ -1846,8 +1846,8 @@ static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo, if (pginfo->hwpage_cnt) { ehca_gen_err( "invalid alignment " - "pgaddr=%lx i=%x " - "mr_pgsize=%lx", + "pgaddr=%llx i=%x " + "mr_pgsize=%llx", pgaddr, i, pginfo->hwpage_size); ret = -EFAULT; @@ -1866,8 +1866,8 @@ static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo, if (ehca_debug_level >= 3) { u64 val = *(u64 *)abs_to_virt( phys_to_abs(pgaddr)); - ehca_gen_dbg("kpage=%lx chunk_page=%lx " - "value=%016lx", + ehca_gen_dbg("kpage=%llx chunk_page=%llx " + "value=%016llx", *kpage, pgaddr, val); } prev_pgaddr = pgaddr; @@ -1944,9 +1944,9 @@ static int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, if ((pginfo->kpage_cnt >= pginfo->num_kpages) || (pginfo->hwpage_cnt >= pginfo->num_hwpages)) { ehca_gen_err("kpage_cnt >= num_kpages, " - "kpage_cnt=%lx num_kpages=%lx " - "hwpage_cnt=%lx " - "num_hwpages=%lx i=%x", + "kpage_cnt=%llx num_kpages=%llx " + "hwpage_cnt=%llx " + "num_hwpages=%llx i=%x", pginfo->kpage_cnt, pginfo->num_kpages, pginfo->hwpage_cnt, @@ -1957,8 +1957,8 @@ static int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, (pbuf->addr & ~(pginfo->hwpage_size - 1)) + (pginfo->next_hwpage * pginfo->hwpage_size)); if ( !(*kpage) && pbuf->addr ) { - ehca_gen_err("pbuf->addr=%lx pbuf->size=%lx " - "next_hwpage=%lx", pbuf->addr, + ehca_gen_err("pbuf->addr=%llx pbuf->size=%llx " + "next_hwpage=%llx", pbuf->addr, pbuf->size, pginfo->next_hwpage); return -EFAULT; } @@ -1996,8 +1996,8 @@ static int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, *kpage = phys_to_abs((*fmrlist & ~(pginfo->hwpage_size - 1)) + pginfo->next_hwpage * pginfo->hwpage_size); if ( !(*kpage) ) { - ehca_gen_err("*fmrlist=%lx fmrlist=%p " - "next_listelem=%lx next_hwpage=%lx", + ehca_gen_err("*fmrlist=%llx fmrlist=%p " + "next_listelem=%llx next_hwpage=%llx", *fmrlist, fmrlist, pginfo->u.fmr.next_listelem, pginfo->next_hwpage); @@ -2025,7 +2025,7 @@ static int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, ~(pginfo->hwpage_size - 1)); if (prev + pginfo->u.fmr.fmr_pgsize != p) { ehca_gen_err("uncontiguous fmr pages " - "found prev=%lx p=%lx " + "found prev=%llx p=%llx " "idx=%x", prev, p, i + j); return -EINVAL; } diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index f161cf1..00c1081 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -331,7 +331,7 @@ static inline int init_qp_queue(struct ehca_shca *shca, if (cnt == (nr_q_pages - 1)) { /* last page! */ if (h_ret != expected_hret) { ehca_err(ib_dev, "hipz_qp_register_rpage() " - "h_ret=%li", h_ret); + "h_ret=%lli", h_ret); ret = ehca2ib_return_code(h_ret); goto init_qp_queue1; } @@ -345,7 +345,7 @@ static inline int init_qp_queue(struct ehca_shca *shca, } else { if (h_ret != H_PAGE_REGISTERED) { ehca_err(ib_dev, "hipz_qp_register_rpage() " - "h_ret=%li", h_ret); + "h_ret=%lli", h_ret); ret = ehca2ib_return_code(h_ret); goto init_qp_queue1; } @@ -709,7 +709,7 @@ static struct ehca_qp *internal_create_qp( h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, &parms); if (h_ret != H_SUCCESS) { - ehca_err(pd->device, "h_alloc_resource_qp() failed h_ret=%li", + ehca_err(pd->device, "h_alloc_resource_qp() failed h_ret=%lli", h_ret); ret = ehca2ib_return_code(h_ret); goto create_qp_exit1; @@ -1010,7 +1010,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not modify SRQ to INIT " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -1024,7 +1024,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not enable SRQ " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -1038,7 +1038,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not modify SRQ to RTR " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -1078,7 +1078,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, &bad_send_wqe_p, NULL, 2); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_h_disable_and_get_wqe() failed" - " ehca_qp=%p qp_num=%x h_ret=%li", + " ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, qp_num, h_ret); return ehca2ib_return_code(h_ret); } @@ -1134,7 +1134,7 @@ static int calc_left_cqes(u64 wqe_p, struct ipz_queue *ipz_queue, if (ipz_queue_abs_to_offset(ipz_queue, wqe_p, &q_ofs)) { ehca_gen_err("Invalid offset for calculating left cqes " - "wqe_p=%#lx wqe_v=%p\n", wqe_p, wqe_v); + "wqe_p=%#llx wqe_v=%p\n", wqe_p, wqe_v); return -EFAULT; } @@ -1168,7 +1168,7 @@ static int check_for_left_cqes(struct ehca_qp *my_qp, struct ehca_shca *shca) &send_wqe_p, &recv_wqe_p, 4); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "disable_and_get_wqe() " - "failed ehca_qp=%p qp_num=%x h_ret=%li", + "failed ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, qp_num, h_ret); return ehca2ib_return_code(h_ret); } @@ -1261,7 +1261,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, mqpcb, my_qp->galpas.kernel); if (h_ret != H_SUCCESS) { ehca_err(ibqp->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, ibqp->qp_num, h_ret); ret = ehca2ib_return_code(h_ret); goto modify_qp_exit1; @@ -1690,7 +1690,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); - ehca_err(ibqp->device, "hipz_h_modify_qp() failed h_ret=%li " + ehca_err(ibqp->device, "hipz_h_modify_qp() failed h_ret=%lli " "ehca_qp=%p qp_num=%x", h_ret, my_qp, ibqp->qp_num); goto modify_qp_exit2; } @@ -1723,7 +1723,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, ret = ehca2ib_return_code(h_ret); ehca_err(ibqp->device, "ENABLE in context of " "RESET_2_INIT failed! Maybe you didn't get " - "a LID h_ret=%li ehca_qp=%p qp_num=%x", + "a LID h_ret=%lli ehca_qp=%p qp_num=%x", h_ret, my_qp, ibqp->qp_num); goto modify_qp_exit2; } @@ -1909,7 +1909,7 @@ int ehca_query_qp(struct ib_qp *qp, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); ehca_err(qp->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, qp->qp_num, h_ret); goto query_qp_exit1; } @@ -2074,7 +2074,7 @@ int ehca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); - ehca_err(ibsrq->device, "hipz_h_modify_qp() failed h_ret=%li " + ehca_err(ibsrq->device, "hipz_h_modify_qp() failed h_ret=%lli " "ehca_qp=%p qp_num=%x", h_ret, my_qp, my_qp->real_qp_num); } @@ -2108,7 +2108,7 @@ int ehca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr) if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); ehca_err(srq->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, my_qp->real_qp_num, h_ret); goto query_srq_exit1; } @@ -2179,7 +2179,7 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); if (h_ret != H_SUCCESS) { - ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%li " + ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%lli " "ehca_qp=%p qp_num=%x", h_ret, my_qp, qp_num); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index c711268..5a3d96f 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -822,7 +822,7 @@ static int generate_flush_cqes(struct ehca_qp *my_qp, struct ib_cq *cq, offset = qmap->next_wqe_idx * ipz_queue->qe_size; wqe = (struct ehca_wqe *)ipz_qeit_calc(ipz_queue, offset); if (!wqe) { - ehca_err(cq->device, "Invalid wqe offset=%#lx on " + ehca_err(cq->device, "Invalid wqe offset=%#llx on " "qp_num=%#x", offset, my_qp->real_qp_num); return nr; } diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c b/drivers/infiniband/hw/ehca/ehca_sqp.c index 706d97a..44447aa 100644 --- a/drivers/infiniband/hw/ehca/ehca_sqp.c +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c @@ -85,7 +85,7 @@ u64 ehca_define_sqp(struct ehca_shca *shca, if (ret != H_SUCCESS) { ehca_err(&shca->ib_device, - "Can't define AQP1 for port %x. h_ret=%li", + "Can't define AQP1 for port %x. h_ret=%lli", port, ret); return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index 21f7d06..f09914c 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -116,7 +116,7 @@ extern int ehca_debug_level; unsigned char *deb = (unsigned char *)(adr); \ for (x = 0; x < l; x += 16) { \ printk(KERN_INFO "EHCA_DMP:%s " format \ - " adr=%p ofs=%04x %016lx %016lx\n", \ + " adr=%p ofs=%04x %016llx %016llx\n", \ __func__, ##args, deb, x, \ *((u64 *)&deb[0]), *((u64 *)&deb[8])); \ deb += 16; \ diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index e43ed8f..3cb688d 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -114,7 +114,7 @@ static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, physical = galpas->user.fw_handle; vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); - ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical); + ehca_gen_dbg("vsize=%llx physical=%llx", vsize, physical); /* VM_IO | VM_RESERVED are set by remap_pfn_range() */ ret = remap_4k_pfn(vma, vma->vm_start, physical >> EHCA_PAGESHIFT, vma->vm_page_prot); diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 79d95d9..d0ab0c0 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -249,7 +249,7 @@ u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, *eq_ist = (u32)outs[5]; if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resource - ret=%li ", ret); + ehca_gen_err("Not enough resource - ret=%lli ", ret); return ret; } @@ -287,7 +287,7 @@ u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, hcp_galpas_ctor(&cq->galpas, outs[5], outs[6]); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%li", ret); + ehca_gen_err("Not enough resources. ret=%lli", ret); return ret; } @@ -362,7 +362,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, hcp_galpas_ctor(&parms->galpas, outs[6], outs[6]); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%li", ret); + ehca_gen_err("Not enough resources. ret=%lli", ret); return ret; } @@ -454,7 +454,7 @@ u64 hipz_h_register_rpage_eq(const struct ipz_adapter_handle adapter_handle, const u64 count) { if (count != 1) { - ehca_gen_err("Ppage counter=%lx", count); + ehca_gen_err("Ppage counter=%llx", count); return H_PARAMETER; } return hipz_h_register_rpage(adapter_handle, @@ -489,7 +489,7 @@ u64 hipz_h_register_rpage_cq(const struct ipz_adapter_handle adapter_handle, const struct h_galpa gal) { if (count != 1) { - ehca_gen_err("Page counter=%lx", count); + ehca_gen_err("Page counter=%llx", count); return H_PARAMETER; } @@ -508,7 +508,7 @@ u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle, const struct h_galpa galpa) { if (count > 1) { - ehca_gen_err("Page counter=%lx", count); + ehca_gen_err("Page counter=%llx", count); return H_PARAMETER; } @@ -557,7 +557,7 @@ u64 hipz_h_modify_qp(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Insufficient resources ret=%li", ret); + ehca_gen_err("Insufficient resources ret=%lli", ret); return ret; } @@ -593,7 +593,7 @@ u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, qp->ipz_qp_handle.handle, /* r6 */ 0, 0, 0, 0, 0, 0); if (ret == H_HARDWARE) - ehca_gen_err("HCA not operational. ret=%li", ret); + ehca_gen_err("HCA not operational. ret=%lli", ret); ret = ehca_plpar_hcall_norets(H_FREE_RESOURCE, adapter_handle.handle, /* r4 */ @@ -601,7 +601,7 @@ u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("Resource still in use. ret=%li", ret); + ehca_gen_err("Resource still in use. ret=%lli", ret); return ret; } @@ -636,7 +636,7 @@ u64 hipz_h_define_aqp1(const struct ipz_adapter_handle adapter_handle, *bma_qp_nr = (u32)outs[1]; if (ret == H_ALIAS_EXIST) - ehca_gen_err("AQP1 already exists. ret=%li", ret); + ehca_gen_err("AQP1 already exists. ret=%lli", ret); return ret; } @@ -658,7 +658,7 @@ u64 hipz_h_attach_mcqp(const struct ipz_adapter_handle adapter_handle, 0, 0); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%li", ret); + ehca_gen_err("Not enough resources. ret=%lli", ret); return ret; } @@ -697,7 +697,7 @@ u64 hipz_h_destroy_cq(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("H_FREE_RESOURCE failed ret=%li ", ret); + ehca_gen_err("H_FREE_RESOURCE failed ret=%lli ", ret); return ret; } @@ -719,7 +719,7 @@ u64 hipz_h_destroy_eq(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("Resource in use. ret=%li ", ret); + ehca_gen_err("Resource in use. ret=%lli ", ret); return ret; } @@ -774,9 +774,9 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, if ((count > 1) && (logical_address_of_page & (EHCA_PAGESIZE-1))) { ehca_gen_err("logical_address_of_page not on a 4k boundary " - "adapter_handle=%lx mr=%p mr_handle=%lx " + "adapter_handle=%llx mr=%p mr_handle=%llx " "pagesize=%x queue_type=%x " - "logical_address_of_page=%lx count=%lx", + "logical_address_of_page=%llx count=%llx", adapter_handle.handle, mr, mr->ipz_mr_handle.handle, pagesize, queue_type, logical_address_of_page, count); -- 1.6.0.5 -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ From Shainer at Mellanox.com Tue Jan 6 18:12:39 2009 From: Shainer at Mellanox.com (Gilad Shainer) Date: Tue, 6 Jan 2009 18:12:39 -0800 Subject: [ofa-general] RE: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F0183F28A@mtiexch01> We need to look on this from the right angel. This is not a "feature" but rather a core component that adds support for a new adapter/NIC. This is the same as the core drivers for the other adapters that are supported already. In general we need to look not only on spec related features, but also to cover features that can benefit OFED and WinOF users (such as IPoIB connected mode or WinVerbs). Gilad. -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Ryan, Jim Sent: Tuesday, January 06, 2009 2:01 PM To: Hefty, Sean; Tziporet Koren; ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: RE: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans Sean, I think that's a good point. What it suggests to me is asking when someone proposes a "non-standard" feature, what process, procedures, documentation, support, etc. if any, should be made available by the entity making the proposal? It seems to me asking the same questions of all proposed features is fair and reasonable, and shouldn't represent an unreasonable barrier to progress. Thoughts? If this already exists, it's my ignorance and I will apologize in advance Thanks again, Jim -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Sean Hefty Sent: Tuesday, January 06, 2009 1:54 PM To: 'Tziporet Koren'; ewg at lists.openfabrics.org Cc: general at lists.openfabrics.org Subject: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans >* Mellanox suggested to add IB over Eth - this is similar to iWARP but >more like IB (e.g. including UD), and can work over ConnectX. >A concern was raised by Intel (Dave Sommers) since it is not a standard >transport. >Decision: This request will be raised in the MWG, and they should >decide if OFA can support it. Just is just my opinion, but in the past, OFED has included non-standard features, like extended connected mode, that are still not part of the IBTA spec. Do we know if such a feature would be accepted into the Linux kernel? I think OFED should base their decision more on the answer to that question than IBTA approval. - Sean _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From jeff at splitrockpr.com Tue Jan 6 20:05:40 2009 From: jeff at splitrockpr.com (Jeffrey Scott) Date: Tue, 06 Jan 2009 20:05:40 -0800 Subject: [ofa-general] Register Today! OFA's 5th Annual International Sonoma Workshop Message-ID: <7B2A390B191A40C1A6CA7B419DFE6782@Gaucho> How Will the Winds of Change Affect Datacenters, HPC and You? Find out by attending OFA's 5th Annual International Sonoma Workshop, March 22-25. Click here to register . Change wasn't only a political theme in 2008; change came to enterprise datacenters, application implementations and high-performance computing (HPC) with the deployment of low-latency switches and server adapters running OpenFabrics software in Linux and Windows production environments. Even bigger changes are in the wind as 2009 begins. OpenFabrics software is already in production at over 50 percent of the world's top 100 HPC sites, including the #1 site. And now, mainstream IT organizations are deploying OpenFabrics software to boost application performance and take their datacenter operations to the next level with 10 Gigabit Ethernet and 40 Gigabit InfiniBand. This year's Sonoma Workshop will focus on the implications of this sea change for datacenter management, HPC productivity, business innovation, and your career. You should attend if you want to understand how OpenFabrics software can help: - Increase the performance of enterprise applications for financial services, engineering, manufacturing, and content delivery in web and cloud environments; - Enable IT cost reduction through green computing and virtualization; - Future-proof and stabilize your applications in the face of huge changes taking place in memory, processors, storage, interconnects, and networking; - Maximize the flexibility of your enterprise datacenter to more effectively support and enable business innovation, productivity growth and cost containment; and - Transform HPC environments by eliminating barriers to the economical deployment of petascale ecosystems. The OpenFabrics Alliance is the worldwide focal point for high-performance, connectivity-related software development. This year's Sonoma workshop will not only help attendees understand the impact that widespread adoption of OpenFabrics software will have, it will also give attendees practical knowledge about how to deploy and use the software in standard Linux and Windows environments. Attendees will have the opportunity to talk with vendors as well as OpenFabrics software developers, and discuss real-world requirements for applications that use technologies such as datacenter Ethernet, Fibre Channel, MPI, NAS, SAN, and sockets as well as databases and cluster file systems. The Sonoma agenda will include presentations from end users, application and hardware vendors, OS providers, developers, and much more. This year's agenda is about helping you prepare yourself and your organization for the exciting changes that are coming. Register now for the 5th Annual International Sonoma Workshop. The Early Bird rate of $495 is available through February 23. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Wed Jan 7 03:23:49 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 7 Jan 2009 03:23:49 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090107-0200 daily build status Message-ID: <20090107112349.2C990E6110E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From tziporet at dev.mellanox.co.il Wed Jan 7 05:31:01 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 07 Jan 2009 15:31:01 +0200 Subject: [ofa-general] Re: [ewg] OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <1231298564.3466.30.camel@sarium.pathscale.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD0164F1F0@mtlexch01.mtl.com> <1231298564.3466.30.camel@sarium.pathscale.com> Message-ID: <4964AE95.70602@mellanox.co.il> Betsy Zeller wrote: > Given that we GA'd OFED 1.3 on Dec 10/08, I'm pretty uncomfortable with > scheduling our OFED 1.4 release almost a full 8 months later. I guess you meant 1.4 and 1.5 and not 1.3 and 1.4 The reason was - we wanted 7 months between the releases. Since 1.4 was pushed to Dec no new development as actually started before the new year. So I wanted more time for development before the feature freeze. > If I remember right, our goal is to have two OFED releases a year. > When we decided this OFED was not that stable, and had less components. So now QA cycle is long and I don't see a way to shorten it :-( Maybe we can change this decision to have a release every 8 months. I will put it in the agenda for next meeting. > How about a schedule as follows: > Feature Freeze: 3/25/09 > Alpha Release: 3/25/09 > Beta Release: 4/20/09 > RC1: 4/06/09 > RC2: 4/20/09 > RC3: 5/04/09 > RC4: 5/18/09 > RC5: 6/01/09 > RC6: 6/15/09 > Release: 6/28/09 > > Or, we could be somewhat more firm about what constitutes a feature > freeze, make that later, and go with less RCs. But, one way or another, > I think we really need to tighten up the schedule. > > > The suggested schedule is almost the same as we presented in SC08 when we thought that 1.4 will be in Nov. One thing we did was to decide to have kernel base 2.6.29 - meaning by March we should have it ready and we will not need to change backports all the time. Another thing that Woody suggested - assume in advance that we will have 6 RCs and allocate time to all of them (so we will not be surprised) So maybe a compromise can be to shift it all in 2 weeks, something like: Feature freeze on 4/01/09 ... Release on 7/13/09 Tziporet From amar.mudrankit at qlogic.com Wed Jan 7 06:15:08 2009 From: amar.mudrankit at qlogic.com (Amar Mudrankit) Date: Wed, 7 Jan 2009 19:45:08 +0530 Subject: [ofa-general] RESEND: net.ipv4.tcp_timestamps In-Reply-To: <99863D2ED484D449811D97A4C44C9CBDA6D3F5@EPEXCH2.qlogic.org> References: <99863D2ED484D449811D97A4C44C9CBDA6D3F5@EPEXCH2.qlogic.org> Message-ID: Any updaate / clarification on this? It looks like this is being reset as mlx4_en tuning parameter.(ofed_scripts/ofa_kernel.spec) On Mon, Dec 1, 2008 at 9:03 PM, John Russo wrote: > > Does anyone know why this value is being altered by OFED? > > "net.ipv4.tcp_timestamps" is being set to 0 during OFED installation. > > Default value of this parameter is set to 1 on standard RHEL/SLES distros which OFED installation script modifies to 0. Also, when OFED is uninstalled, it does not reset these sysctl parameters to their original values. > > This parameter is specifically recommended to be turned ON for High performance network. This is a TCP option that can be used to calculate the Round Trip Measurement in a better and more accurate way than the retransmission timeout method can. Accurate value of retransmission timeout should be determined to avoid unnecessary retransmissions and hence to improve TCP performance. RFC 1323 talks about this TCP extension for High Performance. > > When the parameter net.ipv4.tcp_timestamps=1, then it adds extra 12 bytes into TCP header increasing its size. This has an obvious effect of decrease in bandwidth as we have some extra data flowing. Is this the reason why OFED turns it OFF to net.ipv4.tcp_timestamps=0. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From anuj01 at gmail.com Wed Jan 7 06:33:07 2009 From: anuj01 at gmail.com (=?UTF-8?B?4KSF4KSo4KWB4KSc?=) Date: Wed, 7 Jan 2009 20:03:07 +0530 Subject: [ofa-general] ***SPAM*** building libmthca-1.0.4.tar for libmthca-rdmav2.so Message-ID: Hi I tried to build libmthca-1.0.4 (OFED - 1.2) against libibverbs-1.1.1. [anuj at in03 libmthca-1.0.4]$ ./configure --prefix=/scratch/anuj/my_ofed/ --libdir=/scratch/anuj/my_ofed/lib64 checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking build system type... x86_64-redhat-linux-gnu checking host system type... x86_64-redhat-linux-gnu checking for style of include used by make... GNU . . . checking for long... yes checking size of long... 8 checking for ibv_read_sysfs_file... yes checking for ibv_dontfork_range... yes checking for ibv_dofork_range... yes checking for ibv_register_driver... yes // it also find ibv_register_driver checking whether ld accepts --version-script... yes configure: creating ./config.status config.status: creating Makefile config.status: creating libmthca.spec config.status: creating config.h config.status: config.h is unchanged config.status: executing depfiles commands [anuj at in03 libmthca-1.0.4]$ make && make install Build and installation is successful but libmthca-rdmav2.so is not generated at PREFIX = /home/anuj/my_ofed/lib64/. Rather the traditional mthca.so is generated at PREFIX = /home/anuj/my_ofed/lib64/infiniband/. Should i rename the mthca.so to libmthca-rdmav2.so and put it to required path i.e. /home/anuj/my_ofed/lib64/. Or m doing something wrong? Thanks Alot -- Anuj Aggarwal .''`. : :Ⓐ : # apt-get install hakuna-matata `. `'` `- -------------- next part -------------- An HTML attachment was scrubbed... URL: From cbuchibabu at gmail.com Wed Jan 7 06:54:52 2009 From: cbuchibabu at gmail.com (Buchibabu Chennupati) Date: Wed, 7 Jan 2009 20:24:52 +0530 Subject: [ofa-general] ***SPAM*** Query regarding smpdump and smpquery tools Message-ID: <63b790b10901070654u1b75e59je9e9d488dd9328cb@mail.gmail.com> Hello, I am a newbie and playing around with infiniband-diags code. I tried implementing smpdump similar to the way smpquery is implemented but without any success. Can smpdump be implemented the same way as smpquery is implemented? Also I would like to know the differences between the current implementations of smpquery and smpdump. Thanks in advance Buchibabu. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dledford at redhat.com Wed Jan 7 07:07:23 2009 From: dledford at redhat.com (Doug Ledford) Date: Wed, 07 Jan 2009 10:07:23 -0500 Subject: [ofa-general] RE: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <3F6F638B8D880340AB536D29CD4C1E192F7F0BDF@orsmsx501.amr.corp.intel.com> References: <3F6F638B8D880340AB536D29CD4C1E192F7F0BDF@orsmsx501.amr.corp.intel.com> Message-ID: <1231340843.32405.872.camel@firewall.xsintricity.com> On Tue, 2009-01-06 at 14:00 -0800, Ryan, Jim wrote: > Sean, I think that's a good point. What it suggests to me is asking when someone proposes a "non-standard" feature, what process, procedures, documentation, support, etc. if any, should be made available by the entity making the proposal? > > It seems to me asking the same questions of all proposed features is fair and reasonable, and shouldn't represent an unreasonable barrier to progress. > > Thoughts? If this already exists, it's my ignorance and I will apologize in advance > > Thanks again, Jim > > -----Original Message----- > From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Sean Hefty > Sent: Tuesday, January 06, 2009 1:54 PM > To: 'Tziporet Koren'; ewg at lists.openfabrics.org > Cc: general at lists.openfabrics.org > Subject: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans > > >* Mellanox suggested to add IB over Eth - this is similar to iWARP but > >more like IB (e.g. including UD), and can work over ConnectX. > >A concern was raised by Intel (Dave Sommers) since it is not a standard > >transport. > >Decision: This request will be raised in the MWG, and they should decide > >if OFA can support it. > > Just is just my opinion, but in the past, OFED has included non-standard > features, like extended connected mode, that are still not part of the IBTA > spec. > > Do we know if such a feature would be accepted into the Linux kernel? I think > OFED should base their decision more on the answer to that question than IBTA > approval. FWIW, this is the question I ask before accepting OFED kernel patches into our kernel. With the exception of SDP (which was intentionally allowed) and qlgc_vnic (which was unintentionally allowed), if it's not either in the upstream linux kernel, or slated for inclusion, then I don't include it in our kernel. Hence why xrc and rds support still isn't in our products. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From sashak at voltaire.com Wed Jan 7 07:12:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 7 Jan 2009 17:12:41 +0200 Subject: [ofa-general] ***SPAM*** Query regarding smpdump and smpquery tools In-Reply-To: <63b790b10901070654u1b75e59je9e9d488dd9328cb@mail.gmail.com> References: <63b790b10901070654u1b75e59je9e9d488dd9328cb@mail.gmail.com> Message-ID: <20090107151241.GD11759@sashak.voltaire.com> Hi, On 20:24 Wed 07 Jan , Buchibabu Chennupati wrote: > > I am a newbie and playing around with infiniband-diags code. > I tried implementing smpdump similar to the way smpquery is implemented but > without any success. > Can smpdump be implemented the same way as smpquery is implemented? What do you mean by "the same way"? Sasha > Also I would like to know the differences between the current > implementations of smpquery and smpdump. > > > Thanks in advance > Buchibabu. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From dledford at redhat.com Wed Jan 7 07:11:05 2009 From: dledford at redhat.com (Doug Ledford) Date: Wed, 07 Jan 2009 10:11:05 -0500 Subject: [ofa-general] RE: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F0183F28A@mtiexch01> References: <9FA59C95FFCBB34EA5E42C1A8573784F0183F28A@mtiexch01> Message-ID: <1231341065.32405.876.camel@firewall.xsintricity.com> On Tue, 2009-01-06 at 18:12 -0800, Gilad Shainer wrote: > We need to look on this from the right angel. This is not a "feature" > but rather a core component that adds support for a new adapter/NIC. > This is the same as the core drivers for the other adapters that are > supported already. In all fairness, the comment below was to implement IB over eth. Nothing today does that. iWARP is not IB and has unique requirements. Running full IB over eth is different. Saying it's not a new feature is like saying that when iSCSI over TCP first came out that it wasn't a new feature. Sure, we had SCSI and we had TCP, but we didn't have SCSI over TCP, so adding it *was* a new feature. > In general we need to look not only on spec related features, but also > to cover features that can benefit OFED and WinOF users (such as IPoIB > connected mode or WinVerbs). I'm not so much concerned over IBTA standards. I'm concerned over what makes it into the upstream linux kernels. How much OFED's kernel differs from the upstream kernel directly impacts supportability of the OFED stack in our products. The more it diverges, the higher the support load. We actively control that divergence as a result. > Gilad. > > > -----Original Message----- > From: ewg-bounces at lists.openfabrics.org > [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Ryan, Jim > Sent: Tuesday, January 06, 2009 2:01 PM > To: Hefty, Sean; Tziporet Koren; ewg at lists.openfabrics.org > Cc: general at lists.openfabrics.org > Subject: RE: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans > > Sean, I think that's a good point. What it suggests to me is asking when > someone proposes a "non-standard" feature, what process, procedures, > documentation, support, etc. if any, should be made available by the > entity making the proposal? > > It seems to me asking the same questions of all proposed features is > fair and reasonable, and shouldn't represent an unreasonable barrier to > progress. > > Thoughts? If this already exists, it's my ignorance and I will apologize > in advance > > Thanks again, Jim > > -----Original Message----- > From: ewg-bounces at lists.openfabrics.org > [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Sean Hefty > Sent: Tuesday, January 06, 2009 1:54 PM > To: 'Tziporet Koren'; ewg at lists.openfabrics.org > Cc: general at lists.openfabrics.org > Subject: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans > > >* Mellanox suggested to add IB over Eth - this is similar to iWARP but > >more like IB (e.g. including UD), and can work over ConnectX. > >A concern was raised by Intel (Dave Sommers) since it is not a standard > > >transport. > >Decision: This request will be raised in the MWG, and they should > >decide if OFA can support it. > > Just is just my opinion, but in the past, OFED has included non-standard > features, like extended connected mode, that are still not part of the > IBTA spec. > > Do we know if such a feature would be accepted into the Linux kernel? I > think OFED should base their decision more on the answer to that > question than IBTA approval. > > - Sean > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From cbuchibabu at gmail.com Wed Jan 7 07:25:33 2009 From: cbuchibabu at gmail.com (Buchibabu Chennupati) Date: Wed, 7 Jan 2009 20:55:33 +0530 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Query regarding smpdump and smpquery tools In-Reply-To: <20090107151241.GD11759@sashak.voltaire.com> References: <63b790b10901070654u1b75e59je9e9d488dd9328cb@mail.gmail.com> <20090107151241.GD11759@sashak.voltaire.com> Message-ID: <63b790b10901070725x3d9bf961q5642bf2c4b9b2bb9@mail.gmail.com> Current smpquery uses madrpc_init and calls smpquery with necessary attribute. Basically I want use madrpc_init in the smpdump implementation as well. Regards, Buchibabu. On Wed, Jan 7, 2009 at 8:42 PM, Sasha Khapyorsky wrote: > Hi, > > On 20:24 Wed 07 Jan , Buchibabu Chennupati wrote: > > > > I am a newbie and playing around with infiniband-diags code. > > I tried implementing smpdump similar to the way smpquery is implemented > but > > without any success. > > Can smpdump be implemented the same way as smpquery is implemented? > > What do you mean by "the same way"? > > Sasha > > > Also I would like to know the differences between the current > > implementations of smpquery and smpdump. > > > > > > Thanks in advance > > Buchibabu. > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Jan 7 08:03:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 7 Jan 2009 18:03:45 +0200 Subject: [ofa-general] ***SPAM*** Query regarding smpdump and smpquery tools In-Reply-To: <63b790b10901070725x3d9bf961q5642bf2c4b9b2bb9@mail.gmail.com> References: <63b790b10901070654u1b75e59je9e9d488dd9328cb@mail.gmail.com> <20090107151241.GD11759@sashak.voltaire.com> <63b790b10901070725x3d9bf961q5642bf2c4b9b2bb9@mail.gmail.com> Message-ID: <20090107160345.GE11759@sashak.voltaire.com> On 20:55 Wed 07 Jan , Buchibabu Chennupati wrote: > Current smpquery uses madrpc_init and calls smpquery with necessary > attribute. It just adds another layer, finally madrpc* uses umad*() calls. > Basically I want use madrpc_init in the smpdump implementation as well. What are you trying to achieve this way (technically it should be possible)? Sasha From sashak at voltaire.com Wed Jan 7 08:16:44 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 7 Jan 2009 18:16:44 +0200 Subject: [ofa-general] Re: [PATCH] libibumad: Add sysfs_*() functions to libibumad.map In-Reply-To: <20081229113805.GA25616@mellanox.co.il> References: <20081229113805.GA25616@mellanox.co.il> Message-ID: <20090107161644.GF11759@sashak.voltaire.com> Hi Vladimir, On 13:38 Mon 29 Dec , Vladimir Sokolovsky wrote: > Signed-off-by: Vladimir Sokolovsky > --- > libibumad/src/libibumad.map | 5 +++++ > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/libibumad/src/libibumad.map b/libibumad/src/libibumad.map > index 0154b7f..ea8999e 100644 > --- a/libibumad/src/libibumad.map > +++ b/libibumad/src/libibumad.map > @@ -30,5 +30,10 @@ IBUMAD_1.0 { > umad_debug; > umad_addr_dump; > umad_dump; > + sys_read_gid; > + sys_read_guid; > + sys_read_string; > + sys_read_uint; > + sys_read_uint64; > local: *; > }; I don't think we should expose those functions in libibumad (btw there are no those prototypes in umad.h). It would be better to reimplement related stuff in srptools - simplest workaround could be just copying needed sys_*() functions, but better is to use libibvers calls (as Sean suggested) and to drop libibcommon dependency (note that I'm planning to remove this library completely soon, the patch was on the list already). Sasha From sashak at voltaire.com Wed Jan 7 08:25:14 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 7 Jan 2009 18:25:14 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 is active In-Reply-To: <495C8576.9060004@dev.mellanox.co.il> References: <495C8576.9060004@dev.mellanox.co.il> Message-ID: <20090107162507.GG11759@sashak.voltaire.com> On 10:57 Thu 01 Jan , Yevgeny Kliteynik wrote: > > When switch is coming up after reset, port 0 always reports > logical state ACTIVE. > OpenSM shouldn't clear sw->need_update flag because of port 0. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. BTW is it legal from IBA point of view for switch to setup logical state of port 0 without SM intervention after reset? Sasha > --- > opensm/opensm/osm_port_info_rcv.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c > index 8763b87..02ad586 100644 > --- a/opensm/opensm/osm_port_info_rcv.c > +++ b/opensm/opensm/osm_port_info_rcv.c > @@ -317,7 +317,7 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, > } > > if (ib_port_info_get_port_state(p_pi) > IB_LINK_INIT && p_node->sw && > - p_node->sw->need_update == 1) > + p_node->sw->need_update == 1 && port_num != 0) > p_node->sw->need_update = 0; > > if (p_physp->need_update) > -- > 1.5.1.4 > From sashak at voltaire.com Wed Jan 7 08:35:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 7 Jan 2009 18:35:21 +0200 Subject: [ofa-general] Re: [PATCH] opensm: Add new partition keyword for all switches and hca. In-Reply-To: <495CE3F8.9080506@gmail.com> References: <495CE3F8.9080506@gmail.com> Message-ID: <20090107163521.GH11759@sashak.voltaire.com> Hi Eli, On 17:40 Thu 01 Jan , Eli Dorfman (Voltaire) wrote: > Add new partition keyword for all switches and hca. > To allow firmware upgrade within managed switches we > want all switch port 0 to have full membership. > 'ALL_SWITCH' means all switch end ports in the subnet > 'ALL_CA' means all CA end ports in the subnet. And then we likely will want to extend this set to have all ib node types by adding ALL_ROUTER(S). Also I think better keywords would be ALL_SWITCHES and ALL_CAS - parser cares about abbreviations, so the shorter keyword version will work too - for example 'ALL_CA' will be interpreted as 'ALL_CAS'. And of course related addition to OpenSM man page and doc/partition-config.txt will be useful too. Sasha > New default partition configuration will be: > "Default=0x7fff,ipoib:ALL_CA, ALL_SWITCH=full, SELF=full;" > > Signed-off-by: Eli Dorfman > --- > opensm/opensm/osm_prtn.c | 15 +++++++++------ > opensm/opensm/osm_prtn_config.c | 10 ++++++++-- > 2 files changed, 17 insertions(+), 8 deletions(-) > > diff --git a/opensm/opensm/osm_prtn.c b/opensm/opensm/osm_prtn.c > index be51410..8b9301e 100644 > --- a/opensm/opensm/osm_prtn.c > +++ b/opensm/opensm/osm_prtn.c > @@ -135,7 +135,7 @@ ib_api_status_t osm_prtn_add_port(osm_log_t * p_log, osm_subn_t * p_subn, > } > > ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, osm_subn_t * p_subn, > - osm_prtn_t * p, boolean_t full) > + osm_prtn_t * p, uint8_t type, boolean_t full) > { > cl_qmap_t *p_port_tbl = &p_subn->port_guid_tbl; > cl_map_item_t *p_item; > @@ -146,10 +146,13 @@ ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, osm_subn_t * p_subn, > while (p_item != cl_qmap_end(p_port_tbl)) { > p_port = (osm_port_t *) p_item; > p_item = cl_qmap_next(p_item); > - status = osm_prtn_add_port(p_log, p_subn, p, > - osm_port_get_guid(p_port), full); > - if (status != IB_SUCCESS) > - goto _err; > + if (type == 0xff || > + (osm_node_get_type(p_port->p_node) == type)) { > + status = osm_prtn_add_port(p_log, p_subn, p, > + osm_port_get_guid(p_port), full); > + if (status != IB_SUCCESS) > + goto _err; > + } > } > > _err: > @@ -325,7 +328,7 @@ static ib_api_status_t osm_prtn_make_default(osm_log_t * const p_log, > IB_DEFAULT_PARTIAL_PKEY); > if (!p) > goto _err; > - status = osm_prtn_add_all(p_log, p_subn, p, no_config); > + status = osm_prtn_add_all(p_log, p_subn, p, 0xff, no_config); > if (status != IB_SUCCESS) > goto _err; > cl_map_remove(&p->part_guid_tbl, p_subn->sm_port_guid); > diff --git a/opensm/opensm/osm_prtn_config.c b/opensm/opensm/osm_prtn_config.c > index 9511608..37f2bd6 100644 > --- a/opensm/opensm/osm_prtn_config.c > +++ b/opensm/opensm/osm_prtn_config.c > @@ -64,7 +64,7 @@ extern osm_prtn_t *osm_prtn_make_new(osm_log_t * p_log, osm_subn_t * p_subn, > const char *name, uint16_t pkey); > extern ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, > osm_subn_t * p_subn, > - osm_prtn_t * p, boolean_t full); > + osm_prtn_t * p, uint8_t type, boolean_t full); > extern ib_api_status_t osm_prtn_add_port(osm_log_t * p_log, > osm_subn_t * p_subn, osm_prtn_t * p, > ib_net64_t guid, boolean_t full); > @@ -212,7 +212,13 @@ static int partition_add_port(unsigned lineno, struct part_conf *conf, > > if (!strncmp(name, "ALL", strlen(name))) { > return osm_prtn_add_all(conf->p_log, conf->p_subn, p, > - full) == IB_SUCCESS ? 0 : -1; > + 0xff, full) == IB_SUCCESS ? 0 : -1; > + } else if (!strncmp(name, "ALL_SWITCH", strlen(name))) { > + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, > + IB_NODE_TYPE_SWITCH, full) == IB_SUCCESS ? 0 : -1; > + } else if (!strncmp(name, "ALL_CA", strlen(name))) { > + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, > + IB_NODE_TYPE_CA, full) == IB_SUCCESS ? 0 : -1; > } else if (!strncmp(name, "SELF", strlen(name))) { > guid = cl_ntoh64(conf->p_subn->sm_port_guid); > } else { > -- > 1.5.5 > From sashak at voltaire.com Wed Jan 7 08:43:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 7 Jan 2009 18:43:20 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_prtn.c set switch end ports to full member in default partition configuration In-Reply-To: <495CEE80.1090006@gmail.com> References: <495CEE80.1090006@gmail.com> Message-ID: <20090107164320.GI11759@sashak.voltaire.com> On 18:25 Thu 01 Jan , Eli Dorfman (Voltaire) wrote: > set switch end ports to full member in default partition configuration. Regardless to partition config file existence? Why? This changes OpenSM PM default behavior as described in the man page. Sasha > > Signed-off-by: Eli Dorfman > --- > opensm/opensm/osm_prtn.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/opensm/opensm/osm_prtn.c b/opensm/opensm/osm_prtn.c > index 8b9301e..21c7add 100644 > --- a/opensm/opensm/osm_prtn.c > +++ b/opensm/opensm/osm_prtn.c > @@ -331,6 +331,9 @@ static ib_api_status_t osm_prtn_make_default(osm_log_t * const p_log, > status = osm_prtn_add_all(p_log, p_subn, p, 0xff, no_config); > if (status != IB_SUCCESS) > goto _err; > + status = osm_prtn_add_all(p_log, p_subn, p, IB_NODE_TYPE_SWITCH, TRUE); > + if (status != IB_SUCCESS) > + goto _err; > cl_map_remove(&p->part_guid_tbl, p_subn->sm_port_guid); > status = > osm_prtn_add_port(p_log, p_subn, p, p_subn->sm_port_guid, TRUE); > -- > 1.5.5 > From robert.j.woodruff at intel.com Wed Jan 7 09:35:39 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 7 Jan 2009 09:35:39 -0800 Subject: [ofa-general] RE: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <1231341065.32405.876.camel@firewall.xsintricity.com> References: <9FA59C95FFCBB34EA5E42C1A8573784F0183F28A@mtiexch01> <1231341065.32405.876.camel@firewall.xsintricity.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F417EE2676@orsmsx507.amr.corp.intel.com> Doug wrote, >I'm not so much concerned over IBTA standards. I'm concerned over what >makes it into the upstream linux kernels. How much OFED's kernel >differs from the upstream kernel directly impacts supportability of the >OFED stack in our products. The more it diverges, the higher the >support load. We actively control that divergence as a result. In general, we discussed and decided at the last developer's workshop in Sonoma to try to make sure that any new features that were going into OFED be first accepted for inclusion in the upstream kernel, or at least queued in Roland's tree for upstream. I think we did a pretty good job in OFED 1.4 of adhering to that process, or at least we made significant progress towards that goal. We did this specifically to try to prevent major divergence between the upstream kernel and the OFED kernel. So for a major new feature like IBoE, I think it makes sense to first discuss the patches on ofa-general and perhaps even a RFC on kernel.org before we include it into an OFED release. my 2 cents, woody From weiny2 at llnl.gov Wed Jan 7 10:15:45 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 7 Jan 2009 10:15:45 -0800 Subject: [ofa-general] Re: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F417EE2676@orsmsx507.amr.corp.intel.com> References: <9FA59C95FFCBB34EA5E42C1A8573784F0183F28A@mtiexch01> <1231341065.32405.876.camel@firewall.xsintricity.com> <382A478CAD40FA4FB46605CF81FE39F417EE2676@orsmsx507.amr.corp.intel.com> Message-ID: <20090107101545.210e7f4c.weiny2@llnl.gov> On Wed, 7 Jan 2009 09:35:39 -0800 "Woodruff, Robert J" wrote: > Doug wrote, > > >I'm not so much concerned over IBTA standards. I'm concerned over what > >makes it into the upstream linux kernels. How much OFED's kernel > >differs from the upstream kernel directly impacts supportability of the > >OFED stack in our products. The more it diverges, the higher the > >support load. We actively control that divergence as a result. > > In general, we discussed and decided at the last developer's workshop > in Sonoma to try to make sure that any new features that were going > into OFED be first accepted for inclusion in the upstream kernel, or > at least queued in Roland's tree for upstream. > I think we did a pretty good job in OFED 1.4 of adhering to that > process, or at least we made significant progress towards that goal. > > We did this specifically to try to prevent major divergence between the > upstream kernel and the OFED kernel. So for a major new feature like > IBoE, I think it makes sense to first discuss the patches on ofa-general > and perhaps even a RFC on kernel.org before we include it into an OFED > release. > > my 2 cents, > > woody I agree. OFED should be downstream of kernel.org for as much as possible. New features should be introduced there first. Ira From rdreier at cisco.com Wed Jan 7 11:24:50 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jan 2009 11:24:50 -0800 Subject: [ofa-general] Re: [PATCH] infiniband/ehca: spin_lock_irqsave takes an unsigned long In-Reply-To: <20081231141257.9bafac41.sfr@canb.auug.org.au> (Stephen Rothwell's message of "Wed, 31 Dec 2008 14:12:57 +1100") References: <20081231141257.9bafac41.sfr@canb.auug.org.au> Message-ID: thanks, applied. From rdreier at cisco.com Wed Jan 7 11:27:52 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jan 2009 11:27:52 -0800 Subject: [ofa-general] [PATCH 2/4] ipoib: fix loss of connectivity after bonding failover on both sides In-Reply-To: <49469E92.3060001@Voltaire.COM> (Yossi Etigin's message of "Mon, 15 Dec 2008 20:14:42 +0200") References: <49469C1E.8010307@Voltaire.COM> <49469E92.3060001@Voltaire.COM> Message-ID: > also > initiallize neigh->dgid.raw to have value to compare with. I don't see this anywhere in the patch you sent? Here's the whole thing: (btw only one "L" in "initialize") > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-12-15 19:53:16.000000000 +0200 > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-12-15 19:53:37.000000000 +0200 > @@ -687,26 +687,26 @@ static int ipoib_start_xmit(struct sk_bu > > neigh = *to_ipoib_neigh(skb->dst->neighbour); > > - if (neigh->ah) > - if (unlikely((memcmp(&neigh->dgid.raw, > - skb->dst->neighbour->ha + 4, > - sizeof(union ib_gid))) || > - (neigh->dev != dev))) { > - spin_lock_irqsave(&priv->lock, flags); > - /* > - * It's safe to call ipoib_put_ah() inside > - * priv->lock here, because we know that > - * path->ah will always hold one more reference, > - * so ipoib_put_ah() will never do more than > - * decrement the ref count. > - */ > + if (unlikely((memcmp(&neigh->dgid.raw, > + skb->dst->neighbour->ha + 4, > + sizeof(union ib_gid))) || > + (neigh->dev != dev))) { > + spin_lock_irqsave(&priv->lock, flags); > + /* > + * It's safe to call ipoib_put_ah() inside > + * priv->lock here, because we know that > + * path->ah will always hold one more reference, > + * so ipoib_put_ah() will never do more than > + * decrement the ref count. > + */ > + if (neigh->ah) > ipoib_put_ah(neigh->ah); > - list_del(&neigh->list); > - ipoib_neigh_free(dev, neigh); > - spin_unlock_irqrestore(&priv->lock, flags); > - ipoib_path_lookup(skb, dev); > - return NETDEV_TX_OK; > - } > + list_del(&neigh->list); > + ipoib_neigh_free(dev, neigh); > + spin_unlock_irqrestore(&priv->lock, flags); > + ipoib_path_lookup(skb, dev); > + return NETDEV_TX_OK; > + } > > if (ipoib_cm_get(neigh)) { > if (ipoib_cm_up(neigh)) { From rdreier at cisco.com Wed Jan 7 11:32:13 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jan 2009 11:32:13 -0800 Subject: [ofa-general] Re: [PATCH] infiniband/ehca: use consistent type In-Reply-To: <20081231141453.45d7f2c1.sfr@canb.auug.org.au> (Stephen Rothwell's message of "Wed, 31 Dec 2008 14:14:53 +1100") References: <20081231141453.45d7f2c1.sfr@canb.auug.org.au> Message-ID: If we're going to clean this code up, does it make sense to take it further? More precisely, your patch does: @@ -226,7 +226,7 @@ u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, u32 *eq_ist) { u64 ret; - u64 outs[PLPAR_HCALL9_BUFSIZE]; + unsigned long outs[PLPAR_HCALL9_BUFSIZE]; u64 allocate_controls; but every parameter of ehca_plpar_hcall9() is unsigned long, and the return value is a signed long. So should we change ret to long and all these other declarations to unsigned long while we're touching the code here? - R. From yosefe at Voltaire.COM Wed Jan 7 11:42:22 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 07 Jan 2009 21:42:22 +0200 Subject: [ofa-general] [PATCH 2/4] ipoib: fix loss of connectivity after bonding failover on both sides In-Reply-To: References: <49469C1E.8010307@Voltaire.COM> <49469E92.3060001@Voltaire.COM> Message-ID: <4965059E.1010505@Voltaire.COM> Roland Dreier wrote: > > also > > initialize neigh->dgid.raw to have value to compare with. > > I don't see this anywhere in the patch you sent? Here's the whole thing: > (btw only one "L" in "initialize") > That was removed from the patch because Moni Shoua found it had increased the traffic renewal time in case of SM failover. I forgot to remove it from the changelog as well. -- Fix bonding failover in the case poth peers have failover and gratuitous arp is lost. In that case, ipoib sender side will create ipoib_neigh and issue a path request with the old gid first. When skb->dst->neighbour->ha changes due to arp refresh, ipoib_neigh will not be added to the path->list of the path of the new mgid, because ipoib_neigh already exists. It will not have an ah either, because of sender-side failover. Therefore, it will not get an ah when the path is resolved. The solution here is to compare gids even if neigh->ah is invalid. Signed-off-by: Moni Shoua Signed-off-by: Yossi Etigin --- Fix bugzilla 1286. drivers/infiniband/ulp/ipoib/ipoib_main.c | 38 +++++++++++++++--------------- 1 file changed, 19 insertions(+), 19 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-12-15 19:53:16.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-12-15 19:53:37.000000000 +0200 @@ -687,26 +687,26 @@ static int ipoib_start_xmit(struct sk_bu neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (neigh->ah) - if (unlikely((memcmp(&neigh->dgid.raw, - skb->dst->neighbour->ha + 4, - sizeof(union ib_gid))) || - (neigh->dev != dev))) { - spin_lock_irqsave(&priv->lock, flags); - /* - * It's safe to call ipoib_put_ah() inside - * priv->lock here, because we know that - * path->ah will always hold one more reference, - * so ipoib_put_ah() will never do more than - * decrement the ref count. - */ + if (unlikely((memcmp(&neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof(union ib_gid))) || + (neigh->dev != dev))) { + spin_lock_irqsave(&priv->lock, flags); + /* + * It's safe to call ipoib_put_ah() inside + * priv->lock here, because we know that + * path->ah will always hold one more reference, + * so ipoib_put_ah() will never do more than + * decrement the ref count. + */ + if (neigh->ah) ipoib_put_ah(neigh->ah); - list_del(&neigh->list); - ipoib_neigh_free(dev, neigh); - spin_unlock_irqrestore(&priv->lock, flags); - ipoib_path_lookup(skb, dev); - return NETDEV_TX_OK; - } + list_del(&neigh->list); + ipoib_neigh_free(dev, neigh); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_path_lookup(skb, dev); + return NETDEV_TX_OK; + } if (ipoib_cm_get(neigh)) { if (ipoib_cm_up(neigh)) { -- --Yossi From rdreier at cisco.com Wed Jan 7 11:44:19 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jan 2009 11:44:19 -0800 Subject: [ofa-general] [PATCH 2/4] ipoib: fix loss of connectivity after bonding failover on both sides In-Reply-To: <4965059E.1010505@Voltaire.COM> (Yossi Etigin's message of "Wed, 07 Jan 2009 21:42:22 +0200") References: <49469C1E.8010307@Voltaire.COM> <49469E92.3060001@Voltaire.COM> <4965059E.1010505@Voltaire.COM> Message-ID: > > also > > initialize neigh->dgid.raw to have value to compare with. > That was removed from the patch because Moni Shoua found it had > increased the traffic renewal time in case of SM failover. I forgot to > remove it from the > changelog as well. So does that mean the patch compares against a possibly uninitialized gid value? Is that always safe? - R. From yosefe at Voltaire.COM Wed Jan 7 12:16:33 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 07 Jan 2009 22:16:33 +0200 Subject: [ofa-general] [PATCH 2/4] ipoib: fix loss of connectivity after bonding failover on both sides In-Reply-To: References: <49469C1E.8010307@Voltaire.COM> <49469E92.3060001@Voltaire.COM> <4965059E.1010505@Voltaire.COM> Message-ID: <49650DA1.3020407@Voltaire.COM> I think it is safe in this case. The only interesting case is if the correct value just randomly appears there, before path record query completed successfully. This means that the test can "pass" when it shouldn't only when neigh->ah is NULL. This reverts to the situation before the patch, when neigh->ah != NULL was also needed to perform the "neighbour refresh" stuff. On the other hand, the patch intends to fix only a situation when neigh->dgid is already initialized by a successful path query. Roland Dreier wrote: > > > also > > > initialize neigh->dgid.raw to have value to compare with. > > > That was removed from the patch because Moni Shoua found it had > > increased the traffic renewal time in case of SM failover. I forgot to > > remove it from the > > changelog as well. > > So does that mean the patch compares against a possibly uninitialized > gid value? Is that always safe? > > - R. -- --Yossi From yosefe at Voltaire.COM Wed Jan 7 12:25:38 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 07 Jan 2009 22:25:38 +0200 Subject: [ofa-general] Re: [PATCH v3] ipoib: do not join broadcast group if interface is brought down In-Reply-To: References: <495B2C60.6020008@Voltaire.COM> Message-ID: <49650FC2.4070509@Voltaire.COM> Actually, priv->broadcast was never checked for being non-NULL in mcast_join_task before this patch. The only change in the patch is that priv->broadcast might stay NULL if it was NULL when the function started. So, if anyone could set priv->broadcast to NULL while join_task is running, wouldn't it happen without the patch as well? Roland Dreier wrote: > So what protects priv->broadcast? It seems that the only lock taken > when setting broadcast to NULL is priv->lock. But eg here: > > - priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > + if (priv->broadcast) > + priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > > what prevents broadcast from becoming NULL right after it's tested? > > Also > > + spin_lock_irq(&priv->lock); > + if (priv->broadcast && > + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) > ipoib_mcast_join(dev, priv->broadcast, 0); > > doesn't ipoib_mcast_join() do GFP_KERNEL stuff, which would be a problem > inside a spinlock? (Have you tested this with lockdep turned on?) > > - R. -- --Yossi From dorfman.eli at gmail.com Wed Jan 7 12:30:22 2009 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Wed, 7 Jan 2009 22:30:22 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> <49630C47.4000302@gmail.com> Message-ID: <694d48600901071230x31ca8bb0l55a8245f092d633c@mail.gmail.com> On Tue, Jan 6, 2009 at 5:25 PM, Hal Rosenstock wrote: > On Tue, Jan 6, 2009 at 2:46 AM, Eli Dorfman (Voltaire) > wrote: >>>> reset is not supported by the firmware (at the moment). >>> >>> What is the firmware response to a reset ? >> >> i didn't try this since mellanox say they don't support this. > > Which Mellanox device (and fw version) ? > I am working with ConnectX HCA and latest fw versiong 2.6.0. The IS4 switch chip does not support portXmitWait reset. From rdreier at cisco.com Wed Jan 7 13:25:39 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jan 2009 13:25:39 -0800 Subject: [ofa-general] Re: [PATCH v3] ipoib: do not join broadcast group if interface is brought down In-Reply-To: <49650FC2.4070509@Voltaire.COM> (Yossi Etigin's message of "Wed, 07 Jan 2009 22:25:38 +0200") References: <495B2C60.6020008@Voltaire.COM> <49650FC2.4070509@Voltaire.COM> Message-ID: > Roland Dreier wrote: > > So what protects priv->broadcast? It seems that the only lock taken > > when setting broadcast to NULL is priv->lock. But eg here: > > > > - priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > > + if (priv->broadcast) > > + priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > > > > what prevents broadcast from becoming NULL right after it's tested? > > > > Also > > > > + spin_lock_irq(&priv->lock); > > + if (priv->broadcast && > > + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > > if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) > > ipoib_mcast_join(dev, priv->broadcast, 0); > > > > doesn't ipoib_mcast_join() do GFP_KERNEL stuff, which would be a problem > > inside a spinlock? (Have you tested this with lockdep turned on?) > Actually, priv->broadcast was never checked for being non-NULL > in mcast_join_task before this patch. > The only change in the patch is that priv->broadcast might stay NULL > if it was NULL when the function started. So, if anyone could set > priv->broadcast to NULL while join_task is running, wouldn't it happen > without the patch as well? OK, so the race exists in the current code, but no one has hit it enough to track it down. Maybe we can fix it later. But you ignored my second question about lock nesting problems? - R. From rdreier at cisco.com Wed Jan 7 13:45:56 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jan 2009 13:45:56 -0800 Subject: [ofa-general] [PATCH 2/4] ipoib: fix loss of connectivity after bonding failover on both sides In-Reply-To: <49469E92.3060001@Voltaire.COM> (Yossi Etigin's message of "Mon, 15 Dec 2008 20:14:42 +0200") References: <49469C1E.8010307@Voltaire.COM> <49469E92.3060001@Voltaire.COM> Message-ID: So I'm finally understanding this patch. And I finally see that it is adding a 16-byte memcpy to the data path for every packet we send. Is the overhead of this really negligible? Can we think of a better way to handle this rare failure (double failover that causes an ARP to be lost) in a way that doesn't penalize the common datapath? - R. From rdreier at cisco.com Wed Jan 7 13:48:17 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jan 2009 13:48:17 -0800 Subject: [ofa-general] [PATCH 2/4] ipoib: fix loss of connectivity after bonding failover on both sides In-Reply-To: (Roland Dreier's message of "Wed, 07 Jan 2009 13:45:56 -0800") References: <49469C1E.8010307@Voltaire.COM> <49469E92.3060001@Voltaire.COM> Message-ID: > So I'm finally understanding this patch. And I finally see that it is > adding a 16-byte memcpy to the data path for every packet we send. Is > the overhead of this really negligible? Can we think of a better way to > handle this rare failure (double failover that causes an ARP to be lost) > in a way that doesn't penalize the common datapath? Never mind, I see we do the memcmp now also. And I remember that I hated added it originally. So can anyone think of a way to avoid it in general? (But it's not a blocker for this patch) - R. From rdreier at cisco.com Wed Jan 7 13:49:36 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 07 Jan 2009 13:49:36 -0800 Subject: [ofa-general] [PATCH 2/4] ipoib: fix loss of connectivity after bonding failover on both sides In-Reply-To: <49469E92.3060001@Voltaire.COM> (Yossi Etigin's message of "Mon, 15 Dec 2008 20:14:42 +0200") References: <49469C1E.8010307@Voltaire.COM> <49469E92.3060001@Voltaire.COM> Message-ID: ERROR: code indent should use tabs where possible #108: FILE: drivers/infiniband/ulp/ipoib/ipoib_main.c:691: +^I^I skb->dst->neighbour->ha + 4,$ ERROR: code indent should use tabs where possible #109: FILE: drivers/infiniband/ulp/ipoib/ipoib_main.c:692: +^I^I sizeof(union ib_gid))) ||$ ERROR: code indent should use tabs where possible #110: FILE: drivers/infiniband/ulp/ipoib/ipoib_main.c:693: +^I^I (neigh->dev != dev))) {$ total: 3 errors, 0 warnings, 45 lines checked please fix this, and resend with the fixed changelog and I will apply it. From vladsk at gmail.com Wed Jan 7 13:51:43 2009 From: vladsk at gmail.com (Vladimir Sokolovsky) Date: Wed, 07 Jan 2009 23:51:43 +0200 Subject: [ofa-general] Re: [PATCH] libibumad: Add sysfs_*() functions to libibumad.map In-Reply-To: <20090107161644.GF11759@sashak.voltaire.com> References: <20081229113805.GA25616@mellanox.co.il> <20090107161644.GF11759@sashak.voltaire.com> Message-ID: <496523EF.2050206@gmail.com> Sasha Khapyorsky wrote: > Hi Vladimir, > > On 13:38 Mon 29 Dec , Vladimir Sokolovsky wrote: > >> Signed-off-by: Vladimir Sokolovsky >> --- >> libibumad/src/libibumad.map | 5 +++++ >> 1 files changed, 5 insertions(+), 0 deletions(-) >> >> diff --git a/libibumad/src/libibumad.map b/libibumad/src/libibumad.map >> index 0154b7f..ea8999e 100644 >> --- a/libibumad/src/libibumad.map >> +++ b/libibumad/src/libibumad.map >> @@ -30,5 +30,10 @@ IBUMAD_1.0 { >> umad_debug; >> umad_addr_dump; >> umad_dump; >> + sys_read_gid; >> + sys_read_guid; >> + sys_read_string; >> + sys_read_uint; >> + sys_read_uint64; >> local: *; >> }; >> > > I don't think we should expose those functions in libibumad (btw there > are no those prototypes in umad.h). > > It would be better to reimplement related stuff in srptools - simplest > workaround could be just copying needed sys_*() functions, but better is > to use libibvers calls (as Sean suggested) and to drop libibcommon > dependency (note that I'm planning to remove this library completely > soon, the patch was on the list already). > > Sasha > Hi Sasha, Ishai (srptools maintainer) has reimplemented these functions in srptools, Thanks, Vladimir From vladsk at gmail.com Wed Jan 7 13:57:56 2009 From: vladsk at gmail.com (Vladimir Sokolovsky) Date: Wed, 07 Jan 2009 23:57:56 +0200 Subject: ***SPAM*** Re: [ofa-general] building just kernel-ib{-devel} and not being root In-Reply-To: <1231271241.6441.52.camel@pc.interlinx.bc.ca> References: <1231271241.6441.52.camel@pc.interlinx.bc.ca> Message-ID: <49652564.3090904@gmail.com> Brian J. Murrell wrote: > I am wondering if there is a generally supported (or otherwise) way of > taking a pristine source tarball (i.e. OFED-1.4.tgz) and unpacking and > building just the kernel-ib{-devel} packages with a set of option > selections. > > The idea here is to drop something into an automated build system that > builds these packages. Previously I have just called rpmbuild on the > ofa_kernel.spec file directly with a boatload of options. I'm feeling > like this is not quite future-proof and looking for a more supported > method of achieving this. > > One of the caveats is that the normal build and install process of the > install.pl won't work for me. I cannot be root on the build system (and > therefore cannot install packages). > > Any ideas? > > Thanx, > b. > > > Hi Brian, You can download ofa_1_4_kernel.tgz (daily builds) from: http://www.openfabrics.org/downloads/ofa_1_4_kernel/ To compile, run: ./configure [OPTIONS] make Vladimir Regards From jlentini at netapp.com Wed Jan 7 14:11:48 2009 From: jlentini at netapp.com (James Lentini) Date: Wed, 7 Jan 2009 17:11:48 -0500 (EST) Subject: [ofa-general] Configuring a 4 KB InfniBand link MTU Message-ID: I have a Mellanox ConnectX HCA connected to a Cisco SFS 7000D switch. There are no other devices connected to the switch. According to their specifications, both the ConnectX chip (MT25418) in my HCA and the InfiniScale III chip (MT47396) in my switch support a 4 KB InfiniBand link MTU. In practice, I'm getting a 2 KB MTU. Is there some configuration setting I need to change to enable a 4 KB MTU? I'd like to see how a 4 KB MTU affects protocol performance. My topology looks like this: [HCA lid 8, port 1] <------> [SWITCH lid 2, port 1] smpquery shows a MtuCap of 4 KB for the HCA: # smpquery portinfo 8 1 # Port info: Lid 8 port 1 Mkey:............................0x0000000000000000 GidPrefix:.......................0xfe80000000000000 Lid:.............................0x0008 SMLid:...........................0x0002 CapMask:.........................0x2500868 IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported DiagCode:........................0x0000 MkeyLeasePeriod:.................15 LocalPort:.......................1 LinkWidthEnabled:................1X or 4X LinkWidthSupported:..............undefined (0) (IBA extension) LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps LinkState:.......................Active PhysLinkState:...................LinkUp LinkDownDefState:................Polling ProtectBits:.....................0 LMC:.............................0 LinkSpeedActive:.................5.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps NeighborMTU:.....................2048 SMSL:............................0 VLCap:...........................VL0-7 InitType:........................0x00 VLHighLimit:.....................4 VLArbHighCap:....................8 VLArbLowCap:.....................8 InitReply:.......................0x00 MtuCap:..........................4096 VLStallCount:....................0 HoqLife:.........................31 OperVLs:.........................VL0-7 PartEnforceInb:..................0 PartEnforceOutb:.................0 FilterRawInb:....................0 FilterRawOutb:...................0 MkeyViolations:..................0 PkeyViolations:..................0 QkeyViolations:..................0 GuidCap:.........................32 ClientReregister:................0 SubnetTimeout:...................8 RespTimeVal:.....................16 LocalPhysErr:....................0 OverrunErr:......................5 MaxCreditHint:...................0 RoundTrip:.......................0 but smpquery reports only a 2 KB MTU capability for the switch # smpquery portinfo 2 1 # Port info: Lid 2 port 1 Mkey:............................0x0000000000000000 GidPrefix:.......................0x0000000000000000 Lid:.............................0x0000 SMLid:...........................0x0000 CapMask:.........................0x0 DiagCode:........................0x0000 MkeyLeasePeriod:.................0 LocalPort:.......................1 LinkWidthEnabled:................4X LinkWidthSupported:..............1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps LinkState:.......................Active PhysLinkState:...................LinkUp LinkDownDefState:................Polling ProtectBits:.....................0 LMC:.............................0 LinkSpeedActive:.................5.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps NeighborMTU:.....................2048 SMSL:............................0 VLCap:...........................VL0-7 InitType:........................0x00 VLHighLimit:.....................0 VLArbHighCap:....................8 VLArbLowCap:.....................8 InitReply:.......................0x00 MtuCap:..........................2048 VLStallCount:....................7 HoqLife:.........................18 OperVLs:.........................VL0-7 PartEnforceInb:..................0 PartEnforceOutb:.................0 FilterRawInb:....................0 FilterRawOutb:...................0 MkeyViolations:..................0 PkeyViolations:..................0 QkeyViolations:..................0 GuidCap:.........................0 ClientReregister:................0 SubnetTimeout:...................0 RespTimeVal:.....................0 LocalPhysErr:....................15 OverrunErr:......................15 MaxCreditHint:...................0 RoundTrip:.......................0 and I confirmed that the switch had a MT47396: # smpquery nodeinfo 2 # Node info: Lid 2 BaseVers:........................1 ClassVers:.......................1 NodeType:........................Switch NumPorts:........................24 SystemGuid:......................0x0005ad03011df356 Guid:............................0x0005ad00001df356 PortGuid:........................0x0005ad00001df356 PartCap:.........................8 DevId:...........................0xb924 Revision:........................0x000001a1 LocalPort:.......................1 VendorId:........................0x0005ad From hal.rosenstock at gmail.com Wed Jan 7 14:28:48 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 7 Jan 2009 17:28:48 -0500 Subject: [ofa-general] Configuring a 4 KB InfniBand link MTU In-Reply-To: References: Message-ID: James, On Wed, Jan 7, 2009 at 5:11 PM, James Lentini wrote: > > I have a Mellanox ConnectX HCA connected to a Cisco SFS 7000D switch. > There are no other devices connected to the switch. > > According to their specifications, both the ConnectX chip (MT25418) in > my HCA and the InfiniScale III chip (MT47396) in my switch support a 4 > KB InfiniBand link MTU. > > In practice, I'm getting a 2 KB MTU. Is there some configuration > setting I need to change to enable a 4 KB MTU? I'd like to see how a 4 > KB MTU affects protocol performance. > > My topology looks like this: > > [HCA lid 8, port 1] <------> [SWITCH lid 2, port 1] > > smpquery shows a MtuCap of 4 KB for the HCA: > > # smpquery portinfo 8 1 > # Port info: Lid 8 port 1 > Mkey:............................0x0000000000000000 > GidPrefix:.......................0xfe80000000000000 > Lid:.............................0x0008 > SMLid:...........................0x0002 > CapMask:.........................0x2500868 > IsTrapSupported > IsAutomaticMigrationSupported > IsSLMappingSupported > IsSystemImageGUIDsupported > IsVendorClassSupported > IsCapabilityMaskNoticeSupported > IsClientRegistrationSupported > DiagCode:........................0x0000 > MkeyLeasePeriod:.................15 > LocalPort:.......................1 > LinkWidthEnabled:................1X or 4X > LinkWidthSupported:..............undefined (0) (IBA extension) > LinkWidthActive:.................4X > LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps > LinkState:.......................Active > PhysLinkState:...................LinkUp > LinkDownDefState:................Polling > ProtectBits:.....................0 > LMC:.............................0 > LinkSpeedActive:.................5.0 Gbps > LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps > NeighborMTU:.....................2048 > SMSL:............................0 > VLCap:...........................VL0-7 > InitType:........................0x00 > VLHighLimit:.....................4 > VLArbHighCap:....................8 > VLArbLowCap:.....................8 > InitReply:.......................0x00 > MtuCap:..........................4096 > VLStallCount:....................0 > HoqLife:.........................31 > OperVLs:.........................VL0-7 > PartEnforceInb:..................0 > PartEnforceOutb:.................0 > FilterRawInb:....................0 > FilterRawOutb:...................0 > MkeyViolations:..................0 > PkeyViolations:..................0 > QkeyViolations:..................0 > GuidCap:.........................32 > ClientReregister:................0 > SubnetTimeout:...................8 > RespTimeVal:.....................16 > LocalPhysErr:....................0 > OverrunErr:......................5 > MaxCreditHint:...................0 > RoundTrip:.......................0 > > but smpquery reports only a 2 KB MTU capability for the switch If the switch ports support a 4K MTU cap, it needs to be advertised (via the PortInfo response). That's what needs fixing if possible rather than some SM policy. Without an MTUCap of 4K on both sides of the link, an SM cannot set NeighborMTU to 4K. -- Hal > # smpquery portinfo 2 1 > # Port info: Lid 2 port 1 > Mkey:............................0x0000000000000000 > GidPrefix:.......................0x0000000000000000 > Lid:.............................0x0000 > SMLid:...........................0x0000 > CapMask:.........................0x0 > DiagCode:........................0x0000 > MkeyLeasePeriod:.................0 > LocalPort:.......................1 > LinkWidthEnabled:................4X > LinkWidthSupported:..............1X or 4X > LinkWidthActive:.................4X > LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps > LinkState:.......................Active > PhysLinkState:...................LinkUp > LinkDownDefState:................Polling > ProtectBits:.....................0 > LMC:.............................0 > LinkSpeedActive:.................5.0 Gbps > LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps > NeighborMTU:.....................2048 > SMSL:............................0 > VLCap:...........................VL0-7 > InitType:........................0x00 > VLHighLimit:.....................0 > VLArbHighCap:....................8 > VLArbLowCap:.....................8 > InitReply:.......................0x00 > MtuCap:..........................2048 > VLStallCount:....................7 > HoqLife:.........................18 > OperVLs:.........................VL0-7 > PartEnforceInb:..................0 > PartEnforceOutb:.................0 > FilterRawInb:....................0 > FilterRawOutb:...................0 > MkeyViolations:..................0 > PkeyViolations:..................0 > QkeyViolations:..................0 > GuidCap:.........................0 > ClientReregister:................0 > SubnetTimeout:...................0 > RespTimeVal:.....................0 > LocalPhysErr:....................15 > OverrunErr:......................15 > MaxCreditHint:...................0 > RoundTrip:.......................0 > > and I confirmed that the switch had a MT47396: > > # smpquery nodeinfo 2 > # Node info: Lid 2 > BaseVers:........................1 > ClassVers:.......................1 > NodeType:........................Switch > NumPorts:........................24 > SystemGuid:......................0x0005ad03011df356 > Guid:............................0x0005ad00001df356 > PortGuid:........................0x0005ad00001df356 > PartCap:.........................8 > DevId:...........................0xb924 > Revision:........................0x000001a1 > LocalPort:.......................1 > VendorId:........................0x0005ad > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sfr at canb.auug.org.au Wed Jan 7 15:47:31 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Thu, 8 Jan 2009 10:47:31 +1100 Subject: [ofa-general] Re: [PATCH] infiniband/ehca: use consistent type In-Reply-To: References: <20081231141453.45d7f2c1.sfr@canb.auug.org.au> Message-ID: <20090108104731.6135319c.sfr@canb.auug.org.au> Hi Roland, On Wed, 07 Jan 2009 11:32:13 -0800 Roland Dreier wrote: > > If we're going to clean this code up, does it make sense to take it > further? More precisely, your patch does: > > @@ -226,7 +226,7 @@ u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, > u32 *eq_ist) > { > u64 ret; > - u64 outs[PLPAR_HCALL9_BUFSIZE]; > + unsigned long outs[PLPAR_HCALL9_BUFSIZE]; > u64 allocate_controls; > > but every parameter of ehca_plpar_hcall9() is unsigned long, and the > return value is a signed long. So should we change ret to long and all > these other declarations to unsigned long while we're touching the code > here? At least all the others are passed by value and the normal arithmetic promotion rules apply so no warnings are issued. I will see how much of a pain it is to change the others. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From yosefe at voltaire.com Wed Jan 7 15:55:26 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Thu, 8 Jan 2009 01:55:26 +0200 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH v3] ipoib: do not join broadcast group if interface is brought down In-Reply-To: References: <495B2C60.6020008@Voltaire.COM> <49650FC2.4070509@Voltaire.COM> Message-ID: <32cb786f0901071555k6e5a593agb9a2ff46c604f1d6@mail.gmail.com> > > OK, so the race exists in the current code, but no one has hit it enough > to track it down. Maybe we can fix it later. > > But you ignored my second question about lock nesting problems? > > - R. Youre right, ipoib_mcast_join() better not be called with spinlocks. If we assume that we don't fix this race now than that lock can be dropped. From watters at acm.org Wed Jan 7 20:09:03 2009 From: watters at acm.org (Samuel Watters) Date: Wed, 7 Jan 2009 22:09:03 -0600 Subject: [ofa-general] OFED 1.4 on 32-bit PPC? Message-ID: <48AFAECB-E60E-49FA-86B0-7E982C66C598@acm.org> Dear Members: From the release notes for the OFED 1.4 linux sources, support is indicated for the x86, x86_64, PPC64 and ia64 architectures. Since the software supports both 32 & 64-bit x86 is there any known reason it couldn't be built to run on a 32-bit PPC system in addition to the 64-bit PPC? Thank you, Sam Watters From sashak at voltaire.com Wed Jan 7 20:53:29 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 8 Jan 2009 06:53:29 +0200 Subject: [ofa-general] [PATCH] srptools: eliminate libibcommon dependencies In-Reply-To: <496523EF.2050206@gmail.com> References: <20081229113805.GA25616@mellanox.co.il> <20090107161644.GF11759@sashak.voltaire.com> <496523EF.2050206@gmail.com> Message-ID: <20090108045320.GA13222@sashak.voltaire.com> Eliminate libibcommon dependencies - this library will be removed soon. Signed-off-by: Sasha Khapyorsky --- On 23:51 Wed 07 Jan , Vladimir Sokolovsky wrote: > > Ishai (srptools maintainer) has reimplemented these functions in srptools, In order to eliminate libibcommon stuff something like this is needed too (compilation only was tested). Sasha Makefile.am | 2 +- configure.in | 2 -- srp_daemon/srp_daemon.c | 1 - srp_daemon/srp_daemon.h | 18 ++++++++++++++++++ srp_daemon/srp_handle_traps.c | 4 ++-- 5 files changed, 21 insertions(+), 6 deletions(-) diff --git a/Makefile.am b/Makefile.am index dcca848..be1e07a 100644 --- a/Makefile.am +++ b/Makefile.am @@ -6,7 +6,7 @@ man_MANS = man/ibsrpdm.1 man/srp_daemon.1 src_ibsrpdm_CFLAGS = -Wall src_ibsrpdm_SOURCES = src/srp-dm.c -srp_daemon_srp_daemon_LDADD = -libumad -libcommon -libverbs -lpthread +srp_daemon_srp_daemon_LDADD = -libumad -libverbs -lpthread srp_daemon_srp_daemon_CFLAGS = -Wall -I $(DESTDIR)$(includedir) -fno-strict-aliasing srp_daemon_srp_daemon_SOURCES = srp_daemon/srp_daemon.c srp_daemon/srp_handle_traps.c srp_daemon/srp_sync.c diff --git a/configure.in b/configure.in index e06feaf..053133e 100644 --- a/configure.in +++ b/configure.in @@ -20,8 +20,6 @@ AC_PROG_CC # Checks for libraries. if test "$disable_libcheck" != "yes" then -AC_CHECK_LIB([ibcommon], [stack_dump], [], - AC_MSG_ERROR([srptools require libibcommon.])) AC_CHECK_LIB([ibumad], [umad_init], [], AC_MSG_ERROR([srptools require libibumad.])) AC_CHECK_LIB([ibverbs], [ibv_get_device_list], [], diff --git a/srp_daemon/srp_daemon.c b/srp_daemon/srp_daemon.c index 936bcbd..5e1e198 100644 --- a/srp_daemon/srp_daemon.c +++ b/srp_daemon/srp_daemon.c @@ -58,7 +58,6 @@ #include #include #include -#include #include "srp_ib_types.h" #include "srp_daemon.h" diff --git a/srp_daemon/srp_daemon.h b/srp_daemon/srp_daemon.h index eb05d9f..77dcc0d 100644 --- a/srp_daemon/srp_daemon.h +++ b/srp_daemon/srp_daemon.h @@ -37,6 +37,8 @@ #define SRP_DM_H #include +#include +#include #include #include "srp_ib_types.h" @@ -369,6 +371,22 @@ struct resources { static const int node_table_response_size = 1 << 18; +#if __BYTE_ORDER == __LITTLE_ENDIAN +#ifndef ntohll +#define ntohll(x) bswap_64(x) +#endif +#ifndef htonll +#define htonll(x) bswap_64(x) +#endif +#elif __BYTE_ORDER == __BIG_ENDIAN +#ifndef ntohll +#define ntohll(x) (x) +#endif +#ifndef htonll +#define htonll(x) (x) +#endif +#endif /* __BYTE_ORDER == __BIG_ENDIAN */ + #define pr_human(arg...) \ do { \ if (!config->cmd && !config->execute) \ diff --git a/srp_daemon/srp_handle_traps.c b/srp_daemon/srp_handle_traps.c index d973c21..b5f9f83 100644 --- a/srp_daemon/srp_handle_traps.c +++ b/srp_daemon/srp_handle_traps.c @@ -42,7 +42,6 @@ #include #include #include -#include #include #include "srp_ib_types.h" @@ -558,7 +557,8 @@ static int register_to_trap(struct ud_resources *res, int dest_lid, int trap_num pthread_mutex_lock(res->mad_buffer_mutex); res->mad_buffer->base_ver = 0; // flag that the buffer is empty pthread_mutex_unlock(res->mad_buffer_mutex); - mad_hdr->trans_id = htonll(trans_id++); + trans_id++; + mad_hdr->trans_id = htonll(trans_id); ret = ibv_post_send(res->qp, &sr, bad_wr); if (ret) { -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Wed Jan 7 21:01:39 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 8 Jan 2009 07:01:39 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <694d48600901071230x31ca8bb0l55a8245f092d633c@mail.gmail.com> References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> <49630C47.4000302@gmail.com> <694d48600901071230x31ca8bb0l55a8245f092d633c@mail.gmail.com> Message-ID: <20090108050139.GB13222@sashak.voltaire.com> On 22:30 Wed 07 Jan , Eli Dorfman wrote: > > > I am working with ConnectX HCA and latest fw versiong 2.6.0. > The IS4 switch chip does not support portXmitWait reset. But does it report XmitWait counter support in ClassPortInfo? Anyway it would be better to implement XmitWait reset in accordance to ClassPortInfo. Also would be helpful to test it with "non-working" fw - in hope that reset attempt is just ignored there. Sasha From ishai at mellanox.co.il Thu Jan 8 01:30:15 2009 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Thu, 8 Jan 2009 11:30:15 +0200 Subject: [ofa-general] RE: [PATCH] srptools: eliminate libibcommon dependencies In-Reply-To: <20090108045320.GA13222@sashak.voltaire.com> References: <20081229113805.GA25616@mellanox.co.il> <20090107161644.GF11759@sashak.voltaire.com> <496523EF.2050206@gmail.com> <20090108045320.GA13222@sashak.voltaire.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01691B98@mtlexch01.mtl.com> Sasha, Thanks, I applied all but the lest change (trans_id). Why did you make it? Ishai > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, January 08, 2009 6:53 AM > To: Vladimir Sokolovsky; Ishai Rabinovitz > Cc: OpenFabrics General > Subject: [PATCH] srptools: eliminate libibcommon dependencies > > > Eliminate libibcommon dependencies - this library will be removed soon. > > Signed-off-by: Sasha Khapyorsky > --- > > On 23:51 Wed 07 Jan , Vladimir Sokolovsky wrote: > > > > Ishai (srptools maintainer) has reimplemented these functions in > srptools, > > In order to eliminate libibcommon stuff something like this is needed > too (compilation only was tested). > > Sasha > > Makefile.am | 2 +- > configure.in | 2 -- > srp_daemon/srp_daemon.c | 1 - > srp_daemon/srp_daemon.h | 18 ++++++++++++++++++ > srp_daemon/srp_handle_traps.c | 4 ++-- > 5 files changed, 21 insertions(+), 6 deletions(-) > > diff --git a/Makefile.am b/Makefile.am > index dcca848..be1e07a 100644 > --- a/Makefile.am > +++ b/Makefile.am > @@ -6,7 +6,7 @@ man_MANS = man/ibsrpdm.1 man/srp_daemon.1 > src_ibsrpdm_CFLAGS = -Wall > src_ibsrpdm_SOURCES = src/srp-dm.c > > -srp_daemon_srp_daemon_LDADD = -libumad -libcommon -libverbs -lpthread > +srp_daemon_srp_daemon_LDADD = -libumad -libverbs -lpthread > srp_daemon_srp_daemon_CFLAGS = -Wall -I $(DESTDIR)$(includedir) -fno- > strict-aliasing > srp_daemon_srp_daemon_SOURCES = srp_daemon/srp_daemon.c > srp_daemon/srp_handle_traps.c srp_daemon/srp_sync.c > > diff --git a/configure.in b/configure.in > index e06feaf..053133e 100644 > --- a/configure.in > +++ b/configure.in > @@ -20,8 +20,6 @@ AC_PROG_CC > # Checks for libraries. > if test "$disable_libcheck" != "yes" > then > -AC_CHECK_LIB([ibcommon], [stack_dump], [], > - AC_MSG_ERROR([srptools require libibcommon.])) > AC_CHECK_LIB([ibumad], [umad_init], [], > AC_MSG_ERROR([srptools require libibumad.])) > AC_CHECK_LIB([ibverbs], [ibv_get_device_list], [], > diff --git a/srp_daemon/srp_daemon.c b/srp_daemon/srp_daemon.c > index 936bcbd..5e1e198 100644 > --- a/srp_daemon/srp_daemon.c > +++ b/srp_daemon/srp_daemon.c > @@ -58,7 +58,6 @@ > #include > #include > #include > -#include > #include "srp_ib_types.h" > > #include "srp_daemon.h" > diff --git a/srp_daemon/srp_daemon.h b/srp_daemon/srp_daemon.h > index eb05d9f..77dcc0d 100644 > --- a/srp_daemon/srp_daemon.h > +++ b/srp_daemon/srp_daemon.h > @@ -37,6 +37,8 @@ > #define SRP_DM_H > > #include > +#include > +#include > #include > > #include "srp_ib_types.h" > @@ -369,6 +371,22 @@ struct resources { > > static const int node_table_response_size = 1 << 18; > > +#if __BYTE_ORDER == __LITTLE_ENDIAN > +#ifndef ntohll > +#define ntohll(x) bswap_64(x) > +#endif > +#ifndef htonll > +#define htonll(x) bswap_64(x) > +#endif > +#elif __BYTE_ORDER == __BIG_ENDIAN > +#ifndef ntohll > +#define ntohll(x) (x) > +#endif > +#ifndef htonll > +#define htonll(x) (x) > +#endif > +#endif /* __BYTE_ORDER == __BIG_ENDIAN */ > + > #define pr_human(arg...) \ > do { \ > if (!config->cmd && !config->execute) \ > diff --git a/srp_daemon/srp_handle_traps.c > b/srp_daemon/srp_handle_traps.c > index d973c21..b5f9f83 100644 > --- a/srp_daemon/srp_handle_traps.c > +++ b/srp_daemon/srp_handle_traps.c > @@ -42,7 +42,6 @@ > #include > #include > #include > -#include > #include > > #include "srp_ib_types.h" > @@ -558,7 +557,8 @@ static int register_to_trap(struct ud_resources > *res, int dest_lid, int trap_num > pthread_mutex_lock(res->mad_buffer_mutex); > res->mad_buffer->base_ver = 0; // flag that the buffer is > empty > pthread_mutex_unlock(res->mad_buffer_mutex); > - mad_hdr->trans_id = htonll(trans_id++); > + trans_id++; > + mad_hdr->trans_id = htonll(trans_id); > > ret = ibv_post_send(res->qp, &sr, bad_wr); > if (ret) { > -- > 1.6.0.4.766.g6fc4a From vlad at lists.openfabrics.org Thu Jan 8 03:12:33 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 8 Jan 2009 03:12:33 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090108-0200 daily build status Message-ID: <20090108111233.F12E0E60F0D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From tziporet at dev.mellanox.co.il Thu Jan 8 03:18:14 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 08 Jan 2009 13:18:14 +0200 Subject: [ofa-general] OFED 1.4 on 32-bit PPC? In-Reply-To: <48AFAECB-E60E-49FA-86B0-7E982C66C598@acm.org> References: <48AFAECB-E60E-49FA-86B0-7E982C66C598@acm.org> Message-ID: <4965E0F6.3050201@mellanox.co.il> Samuel Watters wrote: > Dear Members: > > From the release notes for the OFED 1.4 linux sources, support is > indicated for the x86, x86_64, PPC64 and ia64 architectures. Since > the software supports both 32 & 64-bit x86 is there any known reason > it couldn't be built to run on a 32-bit PPC system in addition to the > 64-bit PPC? > > We do support compiling 32 bits user space libraries in PPC64 The main reason that PPC32 is not in the list of supported architectures is that we do not have such a system with HCAs for testing. So you can try to build and see if you encounter any problems If you will have any fixes please provide them Tziporet From tziporet at dev.mellanox.co.il Thu Jan 8 04:02:11 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 08 Jan 2009 14:02:11 +0200 Subject: [ofa-general] Re: [ewg] RE: OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <20090107101545.210e7f4c.weiny2@llnl.gov> References: <9FA59C95FFCBB34EA5E42C1A8573784F0183F28A@mtiexch01> <1231341065.32405.876.camel@firewall.xsintricity.com> <382A478CAD40FA4FB46605CF81FE39F417EE2676@orsmsx507.amr.corp.intel.com> <20090107101545.210e7f4c.weiny2@llnl.gov> Message-ID: <4965EB43.1030204@mellanox.co.il> Ira Weiny wrote: > > I agree. OFED should be downstream of kernel.org for as much as possible. New > features should be introduced there first. > > Ira > > I totally agree and we are going to send all patches to kernel.org soon. Note that the changes are not only influencing kernel space but user space too. The reason I brought it to OFA is that like iWARP at the past we also need OFA decision Tziporet From olga.shern at gmail.com Thu Jan 8 04:05:10 2009 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Thu, 8 Jan 2009 14:05:10 +0200 Subject: [ofa-general] ***SPAM*** Re: [ewg] OFED Jan 5, 2009 meeting minutes on OFED plans In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD0164F1F0@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD015FFC3E@mtlexch01.mtl.com> <5D49E7A8952DC44FB38C38FA0D758EAD0164F1F0@mtlexch01.mtl.com> Message-ID: > - Kernel base will be 2.6.29 Hi, Kernel 2.6.29 window will be closed very soon, so it means that we cannot have any new features in this kernel. Therefore no new features in OFED 1.5. I think we should be based on 2.6.30. And I agree with Tziporet regarding the OFED 1.5 schedule, no need to rush, OFED is mature enough, therefore no need to have releases every 1/2 year. Olga From dorfman.eli at gmail.com Thu Jan 8 04:17:30 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 08 Jan 2009 14:17:30 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <20090108050139.GB13222@sashak.voltaire.com> References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> <49630C47.4000302@gmail.com> <694d48600901071230x31ca8bb0l55a8245f092d633c@mail.gmail.com> <20090108050139.GB13222@sashak.voltaire.com> Message-ID: <4965EEDA.6030504@gmail.com> Sasha Khapyorsky wrote: > On 22:30 Wed 07 Jan , Eli Dorfman wrote: >> I am working with ConnectX HCA and latest fw versiong 2.6.0. >> The IS4 switch chip does not support portXmitWait reset. > > But does it report XmitWait counter support in ClassPortInfo? Anyway it > would be better to implement XmitWait reset in accordance to > ClassPortInfo. Also would be helpful to test it with "non-working" fw - > in hope that reset attempt is just ignored there. I understand that ClassPortInfo specifies whether PortXmitWait is supported. It does not distinguish between get and set operation. We have seen that ClassPortInfo to IS4 returns garbage. I don't see what's the point of testing the utility vs. a non working fw. Eli From sashak at voltaire.com Thu Jan 8 05:02:58 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 8 Jan 2009 15:02:58 +0200 Subject: [ofa-general] Re: [PATCH] srptools: eliminate libibcommon dependencies In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD01691B98@mtlexch01.mtl.com> References: <20081229113805.GA25616@mellanox.co.il> <20090107161644.GF11759@sashak.voltaire.com> <496523EF.2050206@gmail.com> <20090108045320.GA13222@sashak.voltaire.com> <5D49E7A8952DC44FB38C38FA0D758EAD01691B98@mtlexch01.mtl.com> Message-ID: <20090108130258.GC13222@sashak.voltaire.com> Hi Ishai, On 11:30 Thu 08 Jan , Ishai Rabinovitz wrote: > > Thanks, I applied all but the lest change (trans_id). > Why did you make it? ntohll() is macro (defined via bswap_64() macro), using x++ as macro parameter is not safe - '++' can be executed more than once. Finally I'm getting such code after srp_daemon/srp_handle_traps.c preprocessing (gcc-4.3.2 32-bit x86): mad_hdr->trans_id = (__extension__ ({ union { __extension__ unsigned long long int __ll; unsigned long int __l[2]; } __w, __r; if (__builtin_constant_p (trans_id++)) __r.__ll = ((((trans_id++) & 0xff00000000000000ull) >> 56) | (((trans_id++) & 0x00ff000000000000ull) >> 40) | (((trans_id++) & 0x0000ff0000000000ull) >> 24) | (((trans_id++) & 0x000000ff00000000ull) >> 8) | (((trans_id++) & 0x00000000ff000000ull) << 8) | (((trans_id++) & 0x0000000000ff0000ull) << 24) | (((trans_id++) & 0x000000000000ff00ull) << 40) | (((trans_id++) & 0x00000000000000ffull) << 56)); else { __w.__ll = (trans_id++); __r.__l[0] = (__extension__ ({ register unsigned int __v, __x = (__w.__l[1]); if (__builtin_constant_p (__x)) __v = ((((__x) & 0xff000000) >> 24) | (((__x) & 0x00ff0000) >> 8) | (((__x) & 0x0000ff00) << 8) | (((__x) & 0x000000ff) << 24)); else __asm__ ("bswap %0" : "=r" (__v) : "0" (__x)); __v; })); __r.__l[1] = (__extension__ ({ register unsigned int __v, __x = (__w.__l[0]); if (__builtin_constant_p (__x)) __v = ((((__x) & 0xff000000) >> 24) | (((__x) & 0x00ff0000) >> 8) | (((__x) & 0x0000ff00) << 8) | (((__x) & 0x000000ff) << 24)); else __asm__ ("bswap %0" : "=r" (__v) : "0" (__x)); __v; })); } __r.__ll; })); Sasha From ishai at mellanox.co.il Thu Jan 8 05:29:50 2009 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Thu, 8 Jan 2009 15:29:50 +0200 Subject: [ofa-general] RE: [PATCH] srptools: eliminate libibcommon dependencies In-Reply-To: <20090108130258.GC13222@sashak.voltaire.com> References: <20081229113805.GA25616@mellanox.co.il> <20090107161644.GF11759@sashak.voltaire.com> <496523EF.2050206@gmail.com> <20090108045320.GA13222@sashak.voltaire.com> <5D49E7A8952DC44FB38C38FA0D758EAD01691B98@mtlexch01.mtl.com> <20090108130258.GC13222@sashak.voltaire.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD01691E58@mtlexch01.mtl.com> OK, Thanks. (It is a stupid mistake, still, it is not a real bug, I only need trans_id to be unique) I missed your intention because you also changed the behavior from trans_id++ to ++trans_id. Your change was applied. Ishai > -----Original Message----- > From: Sasha Khapyorsky [mailto:sashak at voltaire.com] > Sent: Thursday, January 08, 2009 3:03 PM > To: Ishai Rabinovitz > Cc: Vladimir Sokolovsky; OpenFabrics General > Subject: Re: [PATCH] srptools: eliminate libibcommon dependencies > > Hi Ishai, > > On 11:30 Thu 08 Jan , Ishai Rabinovitz wrote: > > > > Thanks, I applied all but the lest change (trans_id). > > Why did you make it? > > ntohll() is macro (defined via bswap_64() macro), using x++ as macro > parameter is not safe - '++' can be executed more than once. Finally > I'm > getting such code after srp_daemon/srp_handle_traps.c preprocessing > (gcc-4.3.2 32-bit x86): > > mad_hdr->trans_id = (__extension__ ({ union { __extension__ unsigned > long long int __ll; unsigned long int __l[2]; } __w, __r; if > (__builtin_constant_p (trans_id++)) __r.__ll = ((((trans_id++) & > 0xff00000000000000ull) >> 56) | (((trans_id++) & 0x00ff000000000000ull) > >> 40) | (((trans_id++) & 0x0000ff0000000000ull) >> 24) | > (((trans_id++) & 0x000000ff00000000ull) >> 8) | (((trans_id++) & > 0x00000000ff000000ull) << 8) | (((trans_id++) & 0x0000000000ff0000ull) > << 24) | (((trans_id++) & 0x000000000000ff00ull) << 40) | > (((trans_id++) & 0x00000000000000ffull) << 56)); else { __w.__ll = > (trans_id++); __r.__l[0] = (__extension__ ({ register unsigned int __v, > __x = (__w.__l[1]); if (__builtin_constant_p (__x)) __v = ((((__x) & > 0xff000000) >> 24) | (((__x) & 0x00ff0000) >> 8) | (((__x) & > 0x0000ff00) << 8) | (((__x) & 0x000000ff) << 24)); else __asm__ ("bswap > %0" : "=r" (__v) : "0" (__x)); __v; })); __r.__l[1] = (__extension__ ({ > register unsigned int __v, __x = (__w.__l[0]); if (__builtin_constant_p > (__x)) __v = ((((__x) & 0xff000000) >> 24) | (((__x) & 0x00ff0000) >> > 8) | (((__x) & 0x0000ff00) << 8) | (((__x) & 0x000000ff) << 24)); else > __asm__ ("bswap %0" : "=r" (__v) : "0" (__x)); __v; })); } __r.__ll; > })); > > Sasha From cbuchibabu at gmail.com Thu Jan 8 05:33:16 2009 From: cbuchibabu at gmail.com (Buchibabu Chennupati) Date: Thu, 8 Jan 2009 19:03:16 +0530 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Query regarding smpdump and smpquery tools In-Reply-To: <20090107160345.GE11759@sashak.voltaire.com> References: <63b790b10901070654u1b75e59je9e9d488dd9328cb@mail.gmail.com> <20090107151241.GD11759@sashak.voltaire.com> <63b790b10901070725x3d9bf961q5642bf2c4b9b2bb9@mail.gmail.com> <20090107160345.GE11759@sashak.voltaire.com> Message-ID: <63b790b10901080533g493c00aex584b39be5a72cdc@mail.gmail.com> I have an application which emulates smpquery and smpdump. It uses madrpc_init for initialization and then uses umad_send and umad_recv to send and receive packets. It works for smpquery but doesn't work for smpdump. I have verified the contents of the packets through my application and through infiniband-diags tools smpdump and smpquery. Everything looks correct but not sure why smpdump was failing in my application. Regards, Buchibabu. On Wed, Jan 7, 2009 at 9:33 PM, Sasha Khapyorsky wrote: > On 20:55 Wed 07 Jan , Buchibabu Chennupati wrote: > > Current smpquery uses madrpc_init and calls smpquery with necessary > > attribute. > > It just adds another layer, finally madrpc* uses umad*() calls. > > > Basically I want use madrpc_init in the smpdump implementation as well. > > What are you trying to achieve this way (technically it should be > possible)? > > Sasha > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu Jan 8 06:04:57 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 8 Jan 2009 09:04:57 -0500 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <4965EEDA.6030504@gmail.com> References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> <49630C47.4000302@gmail.com> <694d48600901071230x31ca8bb0l55a8245f092d633c@mail.gmail.com> <20090108050139.GB13222@sashak.voltaire.com> <4965EEDA.6030504@gmail.com> Message-ID: Eli. On Thu, Jan 8, 2009 at 7:17 AM, Eli Dorfman (Voltaire) wrote: > I understand that ClassPortInfo specifies whether PortXmitWait is supported. > It does not distinguish between get and set operation. > We have seen that ClassPortInfo to IS4 returns garbage. I don't have such a device. By garbage, do you mean that it responds to Get CPI but the contents of the CPI response specifically CapabilityMask field is garbage ? If that is the case, I don't see a generalized fix for this and only a specific workaround being possible (like device IS4 and firmware version checks which is pretty ugly). Any other ideas ? -- Hal From michael.heinz at qlogic.com Thu Jan 8 06:17:09 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Thu, 8 Jan 2009 08:17:09 -0600 Subject: [ofa-general] Usage of checkpatch In-Reply-To: References: <1228698773-26528-1-git-send-email-ddiss@sgi.com> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E746248BE@MNEXMB1.qlogic.org> Roland, I've noticed that you've mentioned checkpatch.pl a few times; I'd like to think of myself as a good programmer, so I'd like to use it myself - but I'm running into a problem: checkpatch.pl doesn't seem to be included in any of the distros I have installed (RedHat, CentOS, SuSE). Do I need to be running a kernel.org kernel in order to get this script? Thanks in advance, Mike Heinz From ronli at voltaire.com Thu Jan 8 06:39:01 2009 From: ronli at voltaire.com (Ron Livne) Date: Thu, 8 Jan 2009 16:39:01 +0200 Subject: [ofa-general] About create_qp_flags merging Message-ID: Roland, You've told me that you're not going to merge the create_qp_flags patches because they are stuck behind XRC. However, I reposted new patches in December that don't rely on the XRC patches. Can you please comment on them? Here are links to them: Patch 0/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/439299 4 Patch 1/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/439305 4 Patch 2/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/439300 4 Patch 3/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/439301 4 Patch 4/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/439302 4 Thank, Ron -------------- next part -------------- An HTML attachment was scrubbed... URL: From brian at sun.com Thu Jan 8 06:58:39 2009 From: brian at sun.com (Brian J. Murrell) Date: Thu, 08 Jan 2009 09:58:39 -0500 Subject: [ofa-general] building just kernel-ib{-devel} and not being root In-Reply-To: <49652564.3090904@gmail.com> References: <1231271241.6441.52.camel@pc.interlinx.bc.ca> <49652564.3090904@gmail.com> Message-ID: <1231426719.14860.74.camel@pc.interlinx.bc.ca> On Wed, 2009-01-07 at 23:57 +0200, Vladimir Sokolovsky wrote: > > Hi Brian, Hi Vladimir, > You can download ofa_1_4_kernel.tgz (daily builds) from: > > http://www.openfabrics.org/downloads/ofa_1_4_kernel/ > > To compile, run: > ./configure [OPTIONS] > make Well, in fact my goal is to build kernel-ib rpms. Not sure if configure; make in that download will do that, but certainly, I can apply rpmbuild --rebuild to the SRPM that comes out of the OFED GA package and it seems to work. But I was looking for a more "sanctioned" avenue. It would sure be nice if one of the modes of install.pl was to simply build RPMs without installing them and without requiring root. b. From jlentini at netapp.com Thu Jan 8 07:25:00 2009 From: jlentini at netapp.com (James Lentini) Date: Thu, 8 Jan 2009 10:25:00 -0500 (EST) Subject: [ofa-general] Configuring a 4 KB InfniBand link MTU In-Reply-To: References: Message-ID: On Wed, 7 Jan 2009, Hal Rosenstock wrote: > If the switch ports support a 4K MTU cap, it needs to be advertised > (via the PortInfo response). That's what needs fixing if possible > rather than some SM policy. Without an MTUCap of 4K on both sides of > the link, an SM cannot set NeighborMTU to 4K. Thanks Hal. I'll do some more research on my switch to see if it is indeed capable of a 4 KB MTU. This datasheet on the MT47396 lists a 4 KB MTU as one of its features: http://www.mellanox.com/related-docs/prod_silicon/InfiniScaleIII.pdf Perhaps the version of the chip I have doesn't support a 4 KB MTU or the switch hardware/firmware doesn't support it. > > # smpquery portinfo 2 1 > > # Port info: Lid 2 port 1 > > Mkey:............................0x0000000000000000 > > GidPrefix:.......................0x0000000000000000 > > Lid:.............................0x0000 > > SMLid:...........................0x0000 > > CapMask:.........................0x0 > > DiagCode:........................0x0000 > > MkeyLeasePeriod:.................0 > > LocalPort:.......................1 > > LinkWidthEnabled:................4X > > LinkWidthSupported:..............1X or 4X > > LinkWidthActive:.................4X > > LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps > > LinkState:.......................Active > > PhysLinkState:...................LinkUp > > LinkDownDefState:................Polling > > ProtectBits:.....................0 > > LMC:.............................0 > > LinkSpeedActive:.................5.0 Gbps > > LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps > > NeighborMTU:.....................2048 > > SMSL:............................0 > > VLCap:...........................VL0-7 > > InitType:........................0x00 > > VLHighLimit:.....................0 > > VLArbHighCap:....................8 > > VLArbLowCap:.....................8 > > InitReply:.......................0x00 > > MtuCap:..........................2048 > > VLStallCount:....................7 > > HoqLife:.........................18 > > OperVLs:.........................VL0-7 > > PartEnforceInb:..................0 > > PartEnforceOutb:.................0 > > FilterRawInb:....................0 > > FilterRawOutb:...................0 > > MkeyViolations:..................0 > > PkeyViolations:..................0 > > QkeyViolations:..................0 > > GuidCap:.........................0 > > ClientReregister:................0 > > SubnetTimeout:...................0 > > RespTimeVal:.....................0 > > LocalPhysErr:....................15 > > OverrunErr:......................15 > > MaxCreditHint:...................0 > > RoundTrip:.......................0 > > > > and I confirmed that the switch had a MT47396: > > > > # smpquery nodeinfo 2 > > # Node info: Lid 2 > > BaseVers:........................1 > > ClassVers:.......................1 > > NodeType:........................Switch > > NumPorts:........................24 > > SystemGuid:......................0x0005ad03011df356 > > Guid:............................0x0005ad00001df356 > > PortGuid:........................0x0005ad00001df356 > > PartCap:.........................8 > > DevId:...........................0xb924 > > Revision:........................0x000001a1 > > LocalPort:.......................1 > > VendorId:........................0x0005ad > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From celine.bourde at ext.bull.net Thu Jan 8 08:30:31 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Thu, 08 Jan 2009 17:30:31 +0100 Subject: [ofa-general] Ethernet emulation on ConnectX card In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD0146618E@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD0146618E@mtlexch01.mtl.com> Message-ID: <49662A27.5020107@ext.bull.net> I've a ConnectX card and I would like to test Ethernet emulation on both IB ports. Is there any documentation on it ? Should I modify the firmware .ini file to activate the functionality ? Any idea to proceed ? Thanks. Current configuration : # ibv_devinfo hca_id: mlx4_0 fw_ver: 2.6.000 node_guid: 0002:c903:0000:2070 sys_image_guid: 0002:c903:0000:2073 vendor_id: 0x02c9 vendor_part_id: 25418 hw_ver: 0xA0 board_id: MT_04A0110002 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 1 port_lmc: 0x00 port: 2 state: PORT_INIT (2) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 Tziporet Koren wrote: > People requested that I will notify on the new ConnectX FW release > (2.6.0_ > > URL of firmware downloads homepage: > http://www.mellanox.com/content/pages.php?pg=firmware_download > Main changes and new features in this release include: > - Support at GA-level for VPI > - Support for QDR interoperability with InfiniScale IV switch platforms > For the full list of features and other details, please see the Release > Notes on the firmware page. > > Note: The following OFED 1.4 features can be activated only with FW > 2.6.0: > - Use the same device as one port IB and one port Eth. > - Fast register MR send queue work requests. > - Local DMA L_Key. > - Raw Ethertype QP support (one QP per port) -- receive only. > > Tziporet > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From yosefe at Voltaire.COM Thu Jan 8 09:00:07 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Thu, 08 Jan 2009 19:00:07 +0200 Subject: [ofa-general] [PATCH v2] ipoib: fix loss of connectivity after bonding failover on both sides Message-ID: <49663117.2060102@Voltaire.COM> Fix bonding failover in the case poth peers have failover and gratuitous arp is lost. In that case, ipoib sender side will create ipoib_neigh and issue a path request with the old gid first. When skb->dst->neighbour->ha changes due to arp refresh, ipoib_neigh will not be added to the path->list of the path of the new mgid, because ipoib_neigh already exists. It will not have an ah either, because of sender-side failover. Therefore, it will not get an ah when the path is resolved. The solution here is to compare gids even if neigh->ah is invalid. Comparing with an uninitialized value of neigh->dgid is not worse than the situation before the patch. Signed-off-by: Moni Shoua Signed-off-by: Yossi Etigin --- Changes from v1: Remove unneeded changelog clause and style fix. Fix bugzilla 1286. drivers/infiniband/ulp/ipoib/ipoib_main.c | 38 +++++++++++++++--------------- 1 file changed, 19 insertions(+), 19 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2009-01-07 21:47:23.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2009-01-08 18:54:33.000000000 +0200 @@ -687,26 +687,26 @@ static int ipoib_start_xmit(struct sk_bu neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (neigh->ah) - if (unlikely((memcmp(&neigh->dgid.raw, - skb->dst->neighbour->ha + 4, - sizeof(union ib_gid))) || - (neigh->dev != dev))) { - spin_lock_irqsave(&priv->lock, flags); - /* - * It's safe to call ipoib_put_ah() inside - * priv->lock here, because we know that - * path->ah will always hold one more reference, - * so ipoib_put_ah() will never do more than - * decrement the ref count. - */ + if (unlikely((memcmp(&neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof(union ib_gid))) || + (neigh->dev != dev))) { + spin_lock_irqsave(&priv->lock, flags); + /* + * It's safe to call ipoib_put_ah() inside + * priv->lock here, because we know that + * path->ah will always hold one more reference, + * so ipoib_put_ah() will never do more than + * decrement the ref count. + */ + if (neigh->ah) ipoib_put_ah(neigh->ah); - list_del(&neigh->list); - ipoib_neigh_free(dev, neigh); - spin_unlock_irqrestore(&priv->lock, flags); - ipoib_path_lookup(skb, dev); - return NETDEV_TX_OK; - } + list_del(&neigh->list); + ipoib_neigh_free(dev, neigh); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_path_lookup(skb, dev); + return NETDEV_TX_OK; + } if (ipoib_cm_get(neigh)) { if (ipoib_cm_up(neigh)) { -- --Yossi From sashak at voltaire.com Thu Jan 8 09:20:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 8 Jan 2009 19:20:24 +0200 Subject: [ofa-general] ***SPAM*** Query regarding smpdump and smpquery tools In-Reply-To: <63b790b10901080533g493c00aex584b39be5a72cdc@mail.gmail.com> References: <63b790b10901070654u1b75e59je9e9d488dd9328cb@mail.gmail.com> <20090107151241.GD11759@sashak.voltaire.com> <63b790b10901070725x3d9bf961q5642bf2c4b9b2bb9@mail.gmail.com> <20090107160345.GE11759@sashak.voltaire.com> <63b790b10901080533g493c00aex584b39be5a72cdc@mail.gmail.com> Message-ID: <20090108172024.GC1441@sashak.voltaire.com> On 19:03 Thu 08 Jan , Buchibabu Chennupati wrote: > I have an application which emulates smpquery and smpdump. > It uses madrpc_init for initialization and then uses umad_send and umad_recv > to send and receive packets. It works for smpquery but doesn't work for > smpdump. I have verified the contents of the packets through my application > and through infiniband-diags tools smpdump and smpquery. Everything looks > correct but not sure why smpdump was failing in my application. It is hard to diagnose the problem without seeing the code. You can see example of using madrpc_init() with umad_send(). umad_recv() at: http://www.openfabrics.org/git/?p=~sashak/ibsim.git;a=blob;f=tests/mcast_storm.c Sasha From jeff at splitrockpr.com Thu Jan 8 10:16:31 2009 From: jeff at splitrockpr.com (Jeffrey Scott) Date: Thu, 08 Jan 2009 10:16:31 -0800 Subject: [ofa-general] Speaking Opportunities -- OpenFabrics Sonoma Workshop Message-ID: Speaking opportunities still available for Sonoma Workshop The OpenFabrics Alliance is hosting the 5th Annual International Sonoma Workshop from March 22-25, 2009. The agenda is filling up fast, but there are still a few speaking opportunities available. If you have a topic that you'd like to cover at Sonoma, please let us know by Monday, January 19. Opportunities to present at Sonoma are open to OFA members and non-members alike. Presentations can be made by end users, developers, interconnect and storage vendors, and anyone else in the OpenFabrics community. If you would like to speak at the Workshop, please submit a ONE-PAGE proposal for a 30-minute presentation. It should include: - Title (5-6 words) - Abstract (one paragraph) - The presenter's job title, address, telephone, email - Brief description of the presenter's organization (1-2 sentences) Proposals should be submitted to the OpenFabrics Marketing Working Group via Jeff Scott at jeff at splitrockpr.com. To view presentations given at the last Sonoma Workshop, visit http://www.openfabrics.org/archives/april2008sonoma.htm. -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex.estrin at qlogic.com Thu Jan 8 11:05:23 2009 From: alex.estrin at qlogic.com (Alex Estrin) Date: Thu, 8 Jan 2009 13:05:23 -0600 Subject: [ofa-general] [PATCH] ipoib: failure during startup wiith non-default pkey set. In-Reply-To: References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624757@MNEXMB1.qlogic.org> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624907@MNEXMB1.qlogic.org> Hello, > b) I understand the issue you're trying to fix, but thinking > about this, > it seems that rather than picking the first entry in the p_key table > happens to be for the main IPoIB interface, it would be simpler to > understand if we just always used P_Key 0xffff for the main interface > and let the user create whatever other interfaces desired for other > P_Keys. That scenario requires SM configuration as well as operator intervention to the host in order to create secure subnet environment. Maintaining secure fabric from one spot(from Subnet Manager) is much simpler, more reliable and host doesn't need to be reconfigured manually. Main IPoIB interface will pick up a new p_key automatically if follow "first entry of p_key table" convention. Compare it to ethernet vlans configured from the switch, when ethernet host keeps using it's standard eth0 without any change. >Then there wouldn't be any race, and the situation would be > easy to understand and manage. > > - R. Thanks, Alex. From sashak at voltaire.com Thu Jan 8 11:15:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 8 Jan 2009 21:15:11 +0200 Subject: [ofa-general] [PATCH] ibutils: remove -libcommon linkage flag Message-ID: <20090108191510.GD1441@sashak.voltaire.com> Remove -libcommon linkage flag - libibumad doesn't depend from libibcommon anymore and libibcommon will be removed from management tree soon. Signed-off-by: Sasha Khapyorsky --- BTW, all osm.m4 files are equivalent under ibutils tree, IMHO it could be useful to merge them. config/osm.m4 | 2 +- ibis/config/osm.m4 | 2 +- ibmgtsim/config/osm.m4 | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/config/osm.m4 b/config/osm.m4 index da9ae81..f8d92d7 100644 --- a/config/osm.m4 +++ b/config/osm.m4 @@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then elif test -L $with_osm_libs/libopensm.so; then OSM_VENDOR=openib osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB " - OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon" + OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad" else AC_MSG_ERROR([OSM: Fail to recognize vendor type]) fi diff --git a/ibis/config/osm.m4 b/ibis/config/osm.m4 index da9ae81..f8d92d7 100644 --- a/ibis/config/osm.m4 +++ b/ibis/config/osm.m4 @@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then elif test -L $with_osm_libs/libopensm.so; then OSM_VENDOR=openib osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB " - OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon" + OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad" else AC_MSG_ERROR([OSM: Fail to recognize vendor type]) fi diff --git a/ibmgtsim/config/osm.m4 b/ibmgtsim/config/osm.m4 index da9ae81..f8d92d7 100644 --- a/ibmgtsim/config/osm.m4 +++ b/ibmgtsim/config/osm.m4 @@ -137,7 +137,7 @@ if test "x$libcheck" = "xtrue"; then elif test -L $with_osm_libs/libopensm.so; then OSM_VENDOR=openib osm_vendor_sel="-DOSM_VENDOR_INTF_OPENIB " - OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad -libcommon" + OSM_LDFLAGS="$OSM_LDFLAGS -lopensm -losmvendor -losmcomp -libumad" else AC_MSG_ERROR([OSM: Fail to recognize vendor type]) fi -- 1.6.1.rc1.45.g123ed From markus.uhlmann at ifh.uka.de Thu Jan 8 13:01:25 2009 From: markus.uhlmann at ifh.uka.de (Markus Uhlmann) Date: Thu, 8 Jan 2009 22:01:25 +0100 Subject: [ofa-general] installing ofed-1.4 Message-ID: <18790.27045.398492.560860@ifh-tuww01.ifh.uni-karlsruhe.de> hello, i am installing ofed-1.4 (as of 8. jan. 2009) on a 64 bit debian system. after installing a number of rpms without problems, the installation discontinues with the following error (from the log file): -------------------------------------------- Processing files: kernel-ib-1.4-2.6.18_6_amd64 error: File not found: /var/tmp/OFED/lib/modules/2.6.18-6-amd64/updates/kernel/drivers/net/cxgb3 error: File not found: /var/tmp/OFED/lib/modules/2.6.18-6-amd64/updates/kernel/drivers/net/mlx4 -------------------------------------------- these directories were apparently not created during preceeding installation steps. however, the modules have indeed been built correctly, here: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/lib/modules/2.6.18-6-amd64/extra/drivers/net/cxgb3/ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/lib/modules/2.6.18/extra/drivers/net/mlx4/ (containing kernel modules "cxgb3.ko" and "mlx4_core.ko", "mlx4_en.ko") there must be some confuson in the install script, or wherever those paths come from. is there a way to adjust that? thanks, mu ps: this is near the top of the log-file: Installing /root/non-free/infiniband/OFED-1.4-20090108-0600/SRPMS/ofa_kernel-1.4-ofed20090108.src.rpm From rdreier at cisco.com Thu Jan 8 14:13:16 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jan 2009 14:13:16 -0800 Subject: [ofa-general] Re: Usage of checkpatch In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB3E746248BE@MNEXMB1.qlogic.org> (Mike Heinz's message of "Thu, 8 Jan 2009 08:17:09 -0600") References: <1228698773-26528-1-git-send-email-ddiss@sgi.com> <4C2744E8AD2982428C5BFE523DF8CDCB3E746248BE@MNEXMB1.qlogic.org> Message-ID: > I've noticed that you've mentioned checkpatch.pl a few times; I'd like to think of myself as a good programmer, so I'd like to use it myself - but I'm running into a problem: checkpatch.pl doesn't seem to be included in any of the distros I have installed (RedHat, CentOS, SuSE). > > Do I need to be running a kernel.org kernel in order to get this script? checkpatch.pl is in the kernel source in the scripts directory. You don't have to run any particular kernel to run it (it's just a perl script) but the easiest way to get it is to use an up-to-date kernel tree (which you need anyway if you're doing kernel development). - R. From rdreier at cisco.com Thu Jan 8 14:16:23 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 08 Jan 2009 14:16:23 -0800 Subject: [ofa-general] Re: About create_qp_flags merging In-Reply-To: (Ron Livne's message of "Thu, 8 Jan 2009 16:39:01 +0200") References: Message-ID: > You've told me that you're not going to merge the create_qp_flags > patches because they are stuck behind XRC. > > However, I reposted new patches in December that don't rely on the XRC > patches. a) your patches look pointless, because they add a new userspace interface and then don't actually hook anything in libibverbs up to the new interface. b) is there any reason why your patches are so important that I should skip merging the XRC work that was submitted before yours and jump to apply yours first? - R. From raghuarur at gmail.com Thu Jan 8 17:26:23 2009 From: raghuarur at gmail.com (Raghu Arur) Date: Thu, 8 Jan 2009 17:26:23 -0800 Subject: [ofa-general] Qp timeouts Message-ID: <90a961640901081726v57bfd7c6pfbadacff5fe2efb6@mail.gmail.com> We have a setup where two nodes establish a send and receive connections to talk to each other and these two are managed separately. But the problem is when one of the node gets power-cycled, the receive qp never gets notified about the other end dying and it just hangs around forever. Is there a keep-alive that is maintained between the connection managers of the two nodes ? Is there a timeout that can be set on the qp so that these kinds of events get notified fast. ? Thanks, Raghu. From vlad at lists.openfabrics.org Fri Jan 9 03:13:50 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 9 Jan 2009 03:13:50 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090109-0200 daily build status Message-ID: <20090109111350.35DD2E60DEA@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From yosefe at Voltaire.COM Fri Jan 9 07:36:52 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Fri, 09 Jan 2009 17:36:52 +0200 Subject: [ofa-general] Re: [PATCH v3] ipoib: do not join broadcast group if interface is brought down In-Reply-To: References: <495B2C60.6020008@Voltaire.COM> <49650FC2.4070509@Voltaire.COM> Message-ID: <49676F14.8050005@Voltaire.COM> Roland Dreier wrote: > > OK, so the race exists in the current code, but no one has hit it enough > to track it down. Maybe we can fix it later. > > But you ignored my second question about lock nesting problems? > > - R. How about just this one? We stop the task instead of re-joining the broadcast if the interface is brought down. I tested it and it also solves the bug. -- Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2009-01-09 14:02:23.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2009-01-09 16:57:50.000000000 +0200 @@ -570,6 +570,9 @@ void ipoib_mcast_join_task(struct work_s if (!priv->broadcast) { struct ipoib_mcast *broadcast; + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + broadcast = ipoib_mcast_alloc(dev, 1); if (!broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); --Yossi From boris at mellanox.com Fri Jan 9 09:49:35 2009 From: boris at mellanox.com (Boris Shpolyansky) Date: Fri, 9 Jan 2009 09:49:35 -0800 Subject: [ofa-general] Configuring a 4 KB InfniBand link MTU References: Message-ID: <1E3DCD1C63492545881FACB6063A57C103416435@mtiexch01> James, Mellanox InfiniScale III switch chip does support 4K MTU as stated in the product brief. However it requires special FW settings that might or might not be available from/supported by particular switch system vendor. Boris Shpolyansky Sr. Member of Technical Staff, Applications Mellanox Technologies Inc. 350 Oakmead Parkway Sunnyvale, CA 94085 Tel.: (408) 916 0014 Fax: (408) 970 3403 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of James Lentini Sent: Thursday, January 08, 2009 7:25 AM To: Hal Rosenstock Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Configuring a 4 KB InfniBand link MTU On Wed, 7 Jan 2009, Hal Rosenstock wrote: > If the switch ports support a 4K MTU cap, it needs to be advertised > (via the PortInfo response). That's what needs fixing if possible > rather than some SM policy. Without an MTUCap of 4K on both sides of > the link, an SM cannot set NeighborMTU to 4K. Thanks Hal. I'll do some more research on my switch to see if it is indeed capable of a 4 KB MTU. This datasheet on the MT47396 lists a 4 KB MTU as one of its features: http://www.mellanox.com/related-docs/prod_silicon/InfiniScaleIII.pdf Perhaps the version of the chip I have doesn't support a 4 KB MTU or the switch hardware/firmware doesn't support it. > > # smpquery portinfo 2 1 > > # Port info: Lid 2 port 1 > > Mkey:............................0x0000000000000000 > > GidPrefix:.......................0x0000000000000000 > > Lid:.............................0x0000 > > SMLid:...........................0x0000 > > CapMask:.........................0x0 > > DiagCode:........................0x0000 > > MkeyLeasePeriod:.................0 > > LocalPort:.......................1 > > LinkWidthEnabled:................4X > > LinkWidthSupported:..............1X or 4X > > LinkWidthActive:.................4X > > LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps > > LinkState:.......................Active > > PhysLinkState:...................LinkUp > > LinkDownDefState:................Polling > > ProtectBits:.....................0 > > LMC:.............................0 > > LinkSpeedActive:.................5.0 Gbps > > LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps > > NeighborMTU:.....................2048 > > SMSL:............................0 > > VLCap:...........................VL0-7 > > InitType:........................0x00 > > VLHighLimit:.....................0 > > VLArbHighCap:....................8 > > VLArbLowCap:.....................8 > > InitReply:.......................0x00 > > MtuCap:..........................2048 > > VLStallCount:....................7 > > HoqLife:.........................18 > > OperVLs:.........................VL0-7 > > PartEnforceInb:..................0 > > PartEnforceOutb:.................0 > > FilterRawInb:....................0 > > FilterRawOutb:...................0 > > MkeyViolations:..................0 > > PkeyViolations:..................0 > > QkeyViolations:..................0 > > GuidCap:.........................0 > > ClientReregister:................0 > > SubnetTimeout:...................0 > > RespTimeVal:.....................0 > > LocalPhysErr:....................15 > > OverrunErr:......................15 > > MaxCreditHint:...................0 > > RoundTrip:.......................0 > > > > and I confirmed that the switch had a MT47396: > > > > # smpquery nodeinfo 2 > > # Node info: Lid 2 > > BaseVers:........................1 > > ClassVers:.......................1 > > NodeType:........................Switch > > NumPorts:........................24 > > SystemGuid:......................0x0005ad03011df356 > > Guid:............................0x0005ad00001df356 > > PortGuid:........................0x0005ad00001df356 > > PartCap:.........................8 > > DevId:...........................0xb924 > > Revision:........................0x000001a1 > > LocalPort:.......................1 > > VendorId:........................0x0005ad > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Fri Jan 9 10:37:35 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Jan 2009 10:37:35 -0800 Subject: [ofa-general] Qp timeouts In-Reply-To: <90a961640901081726v57bfd7c6pfbadacff5fe2efb6@mail.gmail.com> (Raghu Arur's message of "Thu, 8 Jan 2009 17:26:23 -0800") References: <90a961640901081726v57bfd7c6pfbadacff5fe2efb6@mail.gmail.com> Message-ID: > But the problem is when one of the node gets power-cycled, the receive > qp never gets notified about the other end dying and it just hangs > around forever. Is there a keep-alive that is maintained between the > connection managers of the two nodes ? Is there a timeout that can be > set on the qp so that these kinds of events get notified fast. ? No, if you don't explicitly send anything then nothing will be sent automatically. So if you want keep-alives, you will need to implement them at the application level. - R. From rdreier at cisco.com Fri Jan 9 13:14:45 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Jan 2009 13:14:45 -0800 Subject: [ofa-general] [PATCH] mlx4_core: Fix warning from min() Message-ID: From: Roland Dreier Recent cpumask changes changed num_possible_cpus() from returning an int to returning an unsigned int. This means that doing min(num_possible_cpus(), ) now produces a warning like drivers/net/mlx4/main.c: In function 'mlx4_enable_msi_x': drivers/net/mlx4/main.c:915: warning: comparison of distinct pointer types lacks a cast Fix this by using min_t(int, ...). Signed-off-by: Roland Dreier --- I'll merge this in my next batch... drivers/net/mlx4/main.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 710c79e..6ef2490 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -912,8 +912,8 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) int i; if (msi_x) { - nreq = min(dev->caps.num_eqs - dev->caps.reserved_eqs, - num_possible_cpus() + 1); + nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs, + num_possible_cpus() + 1); entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL); if (!entries) goto no_msi; -- 1.6.0.4 From rdreier at cisco.com Fri Jan 9 13:24:08 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Jan 2009 13:24:08 -0800 Subject: [ofa-general] [PATCH] IB/mlx4: Don't register IB device for adapters with no IB ports Message-ID: If the mlx4_ib driver finds an adapter that has only ethernet ports, the current code will register an IB device with 0 ports. Nothing useful or sensible can be done with such a device, so just skip registering it. Signed-off-by: Roland Dreier --- I'll merge this too unless someone objects strongly. Otherwise if you have a system with a ConnectX NIC and a ConnectX IB HCA, you get strange results (two mlx4 IB devices, only one of which has any ports) drivers/infiniband/hw/mlx4/main.c | 13 +++++++++---- 1 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index dcefe1f..61588bd 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -543,14 +543,21 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) { static int mlx4_ib_version_printed; struct mlx4_ib_dev *ibdev; + int num_ports = 0; int i; - if (!mlx4_ib_version_printed) { printk(KERN_INFO "%s", mlx4_ib_version); ++mlx4_ib_version_printed; } + mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) + num_ports++; + + /* No point in registering a device with no ports... */ + if (num_ports == 0) + return NULL; + ibdev = (struct mlx4_ib_dev *) ib_alloc_device(sizeof *ibdev); if (!ibdev) { dev_err(&dev->pdev->dev, "Device struct alloc failed\n"); @@ -574,9 +581,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) ibdev->ib_dev.owner = THIS_MODULE; ibdev->ib_dev.node_type = RDMA_NODE_IB_CA; ibdev->ib_dev.local_dma_lkey = dev->caps.reserved_lkey; - ibdev->num_ports = 0; - mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) - ibdev->num_ports++; + ibdev->num_ports = num_ports; ibdev->ib_dev.phys_port_cnt = ibdev->num_ports; ibdev->ib_dev.num_comp_vectors = dev->caps.num_comp_vectors; ibdev->ib_dev.dma_device = &dev->pdev->dev; -- 1.6.0.4 From rdreier at cisco.com Fri Jan 9 13:45:09 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Jan 2009 13:45:09 -0800 Subject: [ofa-general] [PATCH] IB/mlx4: Don't register IB device for adapters with no IB ports In-Reply-To: (Roland Dreier's message of "Fri, 09 Jan 2009 13:24:08 -0800") References: Message-ID: By the way Yevgeny, it seems mlx4_en probably wants something similar? Otherwise it seems mlx4_en will allocate a bunch of resources for IB HCAs but not actually create a netdevice... > If the mlx4_ib driver finds an adapter that has only ethernet ports, the > current code will register an IB device with 0 ports. Nothing useful or > sensible can be done with such a device, so just skip registering it. > > Signed-off-by: Roland Dreier > --- > I'll merge this too unless someone objects strongly. Otherwise if you > have a system with a ConnectX NIC and a ConnectX IB HCA, you get strange > results (two mlx4 IB devices, only one of which has any ports) > > drivers/infiniband/hw/mlx4/main.c | 13 +++++++++---- > 1 files changed, 9 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c > index dcefe1f..61588bd 100644 > --- a/drivers/infiniband/hw/mlx4/main.c > +++ b/drivers/infiniband/hw/mlx4/main.c > @@ -543,14 +543,21 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) > { > static int mlx4_ib_version_printed; > struct mlx4_ib_dev *ibdev; > + int num_ports = 0; > int i; > > - > if (!mlx4_ib_version_printed) { > printk(KERN_INFO "%s", mlx4_ib_version); > ++mlx4_ib_version_printed; > } > > + mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) > + num_ports++; > + > + /* No point in registering a device with no ports... */ > + if (num_ports == 0) > + return NULL; > + > ibdev = (struct mlx4_ib_dev *) ib_alloc_device(sizeof *ibdev); > if (!ibdev) { > dev_err(&dev->pdev->dev, "Device struct alloc failed\n"); > @@ -574,9 +581,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) > ibdev->ib_dev.owner = THIS_MODULE; > ibdev->ib_dev.node_type = RDMA_NODE_IB_CA; > ibdev->ib_dev.local_dma_lkey = dev->caps.reserved_lkey; > - ibdev->num_ports = 0; > - mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) > - ibdev->num_ports++; > + ibdev->num_ports = num_ports; > ibdev->ib_dev.phys_port_cnt = ibdev->num_ports; > ibdev->ib_dev.num_comp_vectors = dev->caps.num_comp_vectors; > ibdev->ib_dev.dma_device = &dev->pdev->dev; From rdreier at cisco.com Fri Jan 9 14:05:26 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 09 Jan 2009 14:05:26 -0800 Subject: [ofa-general] [PATCH v2] ipoib: fix loss of connectivity after bonding failover on both sides In-Reply-To: <49663117.2060102@Voltaire.COM> (Yossi Etigin's message of "Thu, 08 Jan 2009 19:00:07 +0200") References: <49663117.2060102@Voltaire.COM> Message-ID: applied, thanks a lot From weiny2 at llnl.gov Fri Jan 9 15:47:42 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 9 Jan 2009 15:47:42 -0800 Subject: [ofa-general] [PATCH 0/3 - no ibcommon] Resubmit libibnetdiscover patches without libibcommon dependancy Message-ID: <20090109154742.07016fa3.weiny2@llnl.gov> I wanted to ping you about the status of these patches but then realized that they would have to be regenerated without libibcommon. Here is a new patch series for libibnetdisc without the libibcommon dependancy. These apply to the latest master. Thanks, Ira From weiny2 at llnl.gov Fri Jan 9 15:47:49 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 9 Jan 2009 15:47:49 -0800 Subject: [ofa-general] ***SPAM*** [PATCH 1/3 - no ibcommon] Create a new library libibnetdisc Message-ID: <20090109154749.4b19c8bf.weiny2@llnl.gov> >From 677ca6d7ead4b720b2ba260cb35aca429190b6e8 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 26 Nov 2008 12:54:47 -0800 Subject: [PATCH] Create a new library libibnetdisc This encompasses the functionality of ibnetdiscover in a C library. It returns a single "ibnd_fabric_t" object which represents the data found during the scan. The NodeInfo, PortInfo, and SwitchInfo are preserved from the queries made on the fabric to be used by the calling function as they see fit. This greatly benefits some diags like iblinkinfo.pl. This diag in particular was re-written using this library in C and has shown an 85% speed up on a ~1000 node cluster. Previous iblinkinfo.pl real 3m35.876s user 0m13.210s sys 1m1.046s New iblinkinfotest real 0m32.869s user 0m0.067s sys 0m0.140s Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/Makefile.am | 1 + infiniband-diags/configure.in | 24 +- infiniband-diags/libibnetdisc/Makefile.am | 62 ++ .../libibnetdisc/include/infiniband/ibnetdisc.h | 282 +++++++ infiniband-diags/libibnetdisc/libibnetdisc.ver | 9 + infiniband-diags/libibnetdisc/man/ibnd_debug.3 | 2 + .../libibnetdisc/man/ibnd_destroy_fabric.3 | 2 + .../libibnetdisc/man/ibnd_discover_fabric.3 | 49 ++ .../libibnetdisc/man/ibnd_find_node_dr.3 | 2 + .../libibnetdisc/man/ibnd_find_node_guid.3 | 25 + .../libibnetdisc/man/ibnd_iter_nodes.3 | 24 + .../libibnetdisc/man/ibnd_iter_nodes_type.3 | 2 + .../libibnetdisc/man/ibnd_linkspeed_str.3 | 2 + .../libibnetdisc/man/ibnd_linkstate_str.3 | 2 + .../libibnetdisc/man/ibnd_linkwidth_str.3 | 26 + .../libibnetdisc/man/ibnd_node_type_str.3 | 2 + .../libibnetdisc/man/ibnd_node_type_str_short.3 | 2 + .../libibnetdisc/man/ibnd_physstate_str.3 | 2 + .../libibnetdisc/man/ibnd_show_progress.3 | 2 + .../libibnetdisc/man/ibnd_update_node.3 | 21 + infiniband-diags/libibnetdisc/src/chassis.c | 817 ++++++++++++++++++ infiniband-diags/libibnetdisc/src/chassis.h | 85 ++ infiniband-diags/libibnetdisc/src/ibnetdisc.c | 871 ++++++++++++++++++++ infiniband-diags/libibnetdisc/src/internal.h | 82 ++ infiniband-diags/libibnetdisc/src/libibnetdisc.map | 27 + .../libibnetdisc/test/iblinkinfotest.c | 395 +++++++++ infiniband-diags/libibnetdisc/test/ibnetdisctest.c | 680 +++++++++++++++ infiniband-diags/libibnetdisc/test/testleaks.c | 179 ++++ 28 files changed, 3676 insertions(+), 3 deletions(-) create mode 100644 infiniband-diags/libibnetdisc/Makefile.am create mode 100644 infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h create mode 100644 infiniband-diags/libibnetdisc/libibnetdisc.ver create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_debug.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_destroy_fabric.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_find_node_dr.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_find_node_guid.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_iter_nodes.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_iter_nodes_type.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_linkspeed_str.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_linkstate_str.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_linkwidth_str.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_node_type_str.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_node_type_str_short.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_physstate_str.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_show_progress.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_update_node.3 create mode 100644 infiniband-diags/libibnetdisc/src/chassis.c create mode 100644 infiniband-diags/libibnetdisc/src/chassis.h create mode 100644 infiniband-diags/libibnetdisc/src/ibnetdisc.c create mode 100644 infiniband-diags/libibnetdisc/src/internal.h create mode 100644 infiniband-diags/libibnetdisc/src/libibnetdisc.map create mode 100644 infiniband-diags/libibnetdisc/test/iblinkinfotest.c create mode 100644 infiniband-diags/libibnetdisc/test/ibnetdisctest.c create mode 100644 infiniband-diags/libibnetdisc/test/testleaks.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index c22ba5e..8e8c3c1 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -1,3 +1,4 @@ +SUBDIRS = libibnetdisc INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 58eea0a..c5b437d 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -48,7 +48,7 @@ fi dnl Checks for header files. AC_HEADER_STDC -AC_CHECK_HEADERS([stdlib.h string.h unistd.h fcntl.h inttypes.h netinet/in.h sys/ioctl.h syslog.h]) +AC_CHECK_HEADERS([stdlib.h string.h unistd.h fcntl.h inttypes.h netinet/in.h sys/ioctl.h]) if test "$disable_libcheck" != "yes" then AC_CHECK_HEADER(infiniband/umad.h, [], @@ -70,7 +70,7 @@ AC_C_CONST dnl Check if we should include test utilities AC_MSG_CHECKING(for --enable-test-utils) AC_ARG_ENABLE(test-utils, -[ --enable-test-utils build additional test utilities], +[ --enable-test-utils build additional test utilities (default=no)], [case "${enableval}" in yes) tutils=yes ;; no) tutils=no ;; @@ -140,6 +140,23 @@ IBSCRIPTPATH_TMP2="`echo $IBSCRIPTPATH_TMP1 | sed 's/^NONE/$ac_default_prefix/'` IBSCRIPTPATH="`eval echo $IBSCRIPTPATH_TMP2`" AC_SUBST(IBSCRIPTPATH) +dnl Begin libibnetdisc stuff +AC_CHECK_HEADERS([stdint.h]) +AC_CHECK_FUNCS([strtoull]) + +ibnetdisc_api_version=`grep LIBVERSION $srcdir/libibnetdisc/libibnetdisc.ver | sed 's/LIBVERSION=//'` +if test -z $ibnetdisc_api_version; then + echo "FAILED to find $srcdir/libibnetdisc/libibnetdisc.ver" + exit 1 +fi +AC_SUBST(ibnetdisc_api_version) +AC_DEFINE_UNQUOTED(API_VERSION, + ["$ibnetdisc_api_version"], + [The API version of this library]) + +dnl End libibnetdisc stuff + + AC_CONFIG_FILES([\ Makefile \ infiniband-diags.spec \ @@ -160,6 +177,7 @@ AC_CONFIG_FILES([\ scripts/ibhosts \ scripts/ibnodes \ scripts/ibswitches \ - scripts/ibrouters + scripts/ibrouters \ + libibnetdisc/Makefile ]) AC_OUTPUT diff --git a/infiniband-diags/libibnetdisc/Makefile.am b/infiniband-diags/libibnetdisc/Makefile.am new file mode 100644 index 0000000..8e0e16b --- /dev/null +++ b/infiniband-diags/libibnetdisc/Makefile.am @@ -0,0 +1,62 @@ + +#SUBDIRS = . + +INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband + +lib_LTLIBRARIES = libibnetdisc.la +sbin_PROGRAMS = + +if ENABLE_TEST_UTILS +sbin_PROGRAMS += test/ibnetdisctest \ + test/iblinkinfotest \ + test/testleaks +endif + +DBGFLAGS = -g + +if HAVE_LD_VERSION_SCRIPT +libibnetdisc_version_script = -Wl,--version-script=$(srcdir)/src/libibnetdisc.map +else +libibnetdisc_version_script = +endif + +libibnetdisc_la_SOURCES = src/ibnetdisc.c src/chassis.c src/chassis.h +libibnetdisc_la_CFLAGS = -Wall $(DBGFLAGS) +libibnetdisc_la_LDFLAGS = -version-info $(ibnetdisc_api_version) \ + -export-dynamic $(libibnetdisc_version_script) \ + -losmcomp -libmad +libibnetdisc_la_DEPENDENCIES = $(srcdir)/src/libibnetdisc.map + +libibnetdiscincludedir = $(includedir)/infiniband + +test_ibnetdisctest_SOURCES = test/ibnetdisctest.c +test_ibnetdisctest_CFLAGS = -Wall $(DBGFLAGS) +test_ibnetdisctest_LDFLAGS = -libnetdisc + +test_iblinkinfotest_SOURCES = test/iblinkinfotest.c +test_iblinkinfotest_CFLAGS = -Wall $(DBGFLAGS) +test_iblinkinfotest_LDFLAGS = -libnetdisc + +test_testleaks_SOURCES = test/testleaks.c +test_testleaks_CFLAGS = -Wall $(DBGFLAGS) +test_testleaks_LDFLAGS = -libnetdisc + +libibnetdiscinclude_HEADERS = $(srcdir)/include/infiniband/ibnetdisc.h + +man_MANS = man/ibnd_debug.3 \ + man/ibnd_destroy_fabric.3 \ + man/ibnd_discover_fabric.3 \ + man/ibnd_find_node_dr.3 \ + man/ibnd_find_node_guid.3 \ + man/ibnd_iter_nodes.3 \ + man/ibnd_iter_nodes_type.3 \ + man/ibnd_linkspeed_str.3 \ + man/ibnd_linkstate_str.3 \ + man/ibnd_linkwidth_str.3 \ + man/ibnd_node_type_str.3 \ + man/ibnd_physstate_str.3 \ + man/ibnd_update_node.3 \ + man/ibnd_show_progress.3 + +EXTRA_DIST = $(srcdir)/src/libibnetdisc.map libibnetdisc.ver autogen.sh + diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h new file mode 100644 index 0000000..773c64b --- /dev/null +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -0,0 +1,282 @@ +/* + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _IBNETDISC_H_ +#define _IBNETDISC_H_ + +#include +#include + +#define MAXHOPS 63 + +/* HASH table defines */ +#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) +#define HTSZ 137 + +#define IBND_DEBUG(...) \ + if (ibdebug) { \ + printf("%s:%d; ", __FILE__, __LINE__); \ + printf(__VA_ARGS__); \ + } +#define IBND_ERROR(...) \ + { \ + fprintf(stderr, "%s:%d; ", __FILE__, __LINE__); \ + fprintf(stderr, __VA_ARGS__); \ + } + +/** ========================================================================= + * ENUM definitions + */ +typedef enum { + IBND_CA_NODE = 1, + IBND_SWITCH_NODE = 2, + IBND_ROUTER_NODE = 3 +} ibnd_node_type_t; + +typedef enum { + IBND_LINK_DOWN = 1, + IBND_LINK_INIT = 2, + IBND_LINK_ARMED = 3, + IBND_LINK_ACTIVE = 4 +} ibnd_link_state_t; + +/** ========================================================================= + * Node + */ +typedef struct switch_info { + int smaenhsp0; +} ibnd_switch_info_t; + +typedef struct node_info { + int base_ver; + int class_ver; + int type; + int numports; + uint64_t sysimgguid; + uint64_t nodeguid; + uint64_t nodeportguid; + uint16_t partition_cap; + uint32_t devid; + uint32_t revision; + int localport; + uint32_t vendid; +} ibnd_node_info_t; + +struct ib_fabric; /* forward declare */ +struct chassis; /* forward declare */ +struct port; /* forward declare */ + +typedef struct node { + struct node *next; /* all node list in fabric */ + struct ib_fabric *fabric; /* the fabric node belongs to */ + + ib_portid_t path_portid; /* path from "from_node" */ + int dist; /* num of hops from "from_node" */ + int smalid; + int smalmc; + ibnd_switch_info_t sw_info; + ibnd_node_info_t info; + char nodedesc[64]; + struct port **ports; /* in order array of port pointers */ + /* the size of this array is info.numports + 1 */ + /* items MAY BE NULL! (ie 0 == switches only) */ + + /* chassis info */ + struct node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ + struct chassis *chassis; /* if != NULL the chassis this node belongs to */ + unsigned char ch_type; + unsigned char ch_anafanum; + unsigned char ch_slotnum; + unsigned char ch_slot; +} ibnd_node_t; + +/** ========================================================================= + * Port + */ +typedef struct port_info { + int base_lid; + int smlid; + int link_speed_supported; + int link_speed_enabled; + int link_speed_active; + int port_state; + int phys_state; + int link_down_def_state; + int mkey_prot_bits; + int lmc; + int neighbor_mtu; + int smsl; + int init_type; + int vl_capability; + int vl_high_limit; + int vl_arb_high_cap; + int vl_arb_low_cap; + int init_reply; + int mtu_cap; + int vl_stall_count; + int hoq_lifetime; + int oper_vls; + int partition_enforce_in; + int partition_enforce_out; + int filter_raw_in; + int filter_raw_out; + int mkey_violations; + int pkey_violations; + int qkey_violations; + int guid_capabilities; + int client_rereg; + int subnet_timeout; + int response_time_val; + int local_phys_error; + int overrun_error; + int max_credit_hint; + uint32_t link_round_trip; + int local_port; + int link_width_supported; + int link_width_enabled; + int link_width_active; + int diag_code; + int mkey_lease; + uint32_t capability_mask; + uint64_t mkey; + uint64_t gid_prefix; +} ibnd_port_info_t; + +typedef struct port { + uint64_t guid; + int portnum; + int ext_portnum; /* optional if != 0 external port num */ + ibnd_node_t *node; /* node this port belongs to */ + ibnd_port_info_t info; + struct port *remoteport; /* null if SMA, or does not exist */ +} ibnd_port_t; + + +/** ========================================================================= + * Chassis data + */ +typedef struct chassis { + struct chassis *next; + uint64_t chassisguid; + int chassisnum; + + /* generic grouping by SystemImageGUID */ + int nodecount; + ibnd_node_t *nodes; + + /* specific to voltaire type nodes */ +#define SPINES_MAX_NUM 12 +#define LINES_MAX_NUM 36 + ibnd_node_t *spinenode[SPINES_MAX_NUM + 1]; + ibnd_node_t *linenode[LINES_MAX_NUM + 1]; +} ibnd_chassis_t; + +/** ========================================================================= + * Fabric + * Main fabric object which is returned and represents the data discovered + */ +typedef struct ib_fabric { + /* the node the discover was initiated from + * "from" parameter in ibnd_discover_fabric + * or by default the node you ar running on + */ + ibnd_node_t *from_node; + /* NULL term list of all nodes in the fabric */ + ibnd_node_t *nodes; + /* NULL terminated list of all chassis found in the fabric */ + ibnd_chassis_t *chassis; + int maxhops_discovered; +} ibnd_fabric_t; + + +/** ========================================================================= + * Initialization (fabric operations) + */ +void ibnd_debug(int i); +void ibnd_show_progress(int i); + +ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, + int timeout_ms, ib_portid_t *from, int hops); + /** + * dev_name: (required) local device name to use to access the fabric + * dev_port: (required) local device port to use to access the fabric + * timeout_ms: (required) gives the timeout for a _SINGLE_ query on + * the fabric. So if there are mutiple nodes not + * responding this may result in a lengthy delay. + * from: (optional) specify the node to start scanning from. + * If NULL start from the node we are running on. + * hops: (optional) Specify how much of the fabric to traverse. + * negative value == scan entire fabric + */ +void ibnd_destroy_fabric(ibnd_fabric_t *fabric); + +/** ========================================================================= + * Node operations + */ +ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid); +ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str); +ibnd_node_t *ibnd_update_node(ibnd_node_t *node); + +typedef void (*ibnd_iter_node_func_t)(ibnd_node_t *node, void *user_data); +void ibnd_iter_nodes(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + void *user_data); +void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + ibnd_node_type_t node_type, + void *user_data); + +/** ========================================================================= + * Str convert functions + */ +char *ibnd_linkwidth_str(int link_width); +char *ibnd_linkstate_str(int link_state); +char *ibnd_physstate_str(int phys_state); +const char *ibnd_node_type_str(ibnd_node_t *node); +const char *ibnd_node_type_str_short(ibnd_node_t *node); +char *ibnd_linkspeed_str(int link_speed, int data_rate); + /* if data_rate == 0 use "SDR", "DDR", etc. */ + /* if data_rate == 1 use "2.5 Gbps", "5.0 Gbps", etc. */ + +/** ========================================================================= + * Chassis queries + */ +uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum); +char *ibnd_get_chassis_type(ibnd_node_t *node); +char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size); + +int ibnd_is_xsigo_guid(uint64_t guid); +int ibnd_is_xsigo_tca(uint64_t guid); +int ibnd_is_xsigo_hca(uint64_t guid); + +#endif /* _IBNETDISC_H_ */ diff --git a/infiniband-diags/libibnetdisc/libibnetdisc.ver b/infiniband-diags/libibnetdisc/libibnetdisc.ver new file mode 100644 index 0000000..a0a5f3c --- /dev/null +++ b/infiniband-diags/libibnetdisc/libibnetdisc.ver @@ -0,0 +1,9 @@ +# In this file we track the current API version +# of the IB net discover interface (and libraries) +# The version is built of the following +# tree numbers: +# API_REV:RUNNING_REV:AGE +# API_REV - advance on any added API +# RUNNING_REV - advance any change to the vendor files +# AGE - number of backward versions the API still supports +LIBVERSION=1:0:0 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_debug.3 b/infiniband-diags/libibnetdisc/man/ibnd_debug.3 new file mode 100644 index 0000000..a4076fc --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_debug.3 @@ -0,0 +1,2 @@ +.\".TH IBND_DEBUG 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_destroy_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_destroy_fabric.3 new file mode 100644 index 0000000..8fe20ae --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_destroy_fabric.3 @@ -0,0 +1,2 @@ +.\".TH IBND_DESTROY_FABRIC 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 new file mode 100644 index 0000000..44d8c65 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 @@ -0,0 +1,49 @@ +.TH IBND_DISCOVER_FABRIC 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- initialize ibnetdiscover library. +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *from, int hops)" +.BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)" +.BI "void ibnd_debug(int i)" +.BI "void ibnd_show_progress(int i)" + + +.SH "DESCRIPTION" +.B ibnd_discover_fabric() +Discover the fabric connected to the port specified by dev_name and dev_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. + +.B ibnd_destroy_fabric() +free all memory and resources associated with the fabric. + +.B ibnd_debug() +Set the debug level to be printed as library operations take place. + +.B ibnd_debug() +Indicate that the library should print debug output which shows it's progress +through the fabric. + +.SH "RETURN VALUE" +.B ibnd_discover_fabric() +return NULL on failure, otherwise a valid ibnd_fabric_t object. + +.B ibnd_destory_fabric(), ibnd_debug() +NONE + +.SH "EXAMPLES" + +.B Discover the entire fabric connected to device "mthca0", port 1. + + ibnd_discover_fabric("mthca0", 1, 100, NULL, 0); + +.B Discover only a single node and those nodes connected to it. + + str2drpath(&(port_id.drpath), from, 0, 0); + + ibnd_discover_fabric("mthca0", 1, 100, &port_id, 1); + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/man/ibnd_find_node_dr.3 b/infiniband-diags/libibnetdisc/man/ibnd_find_node_dr.3 new file mode 100644 index 0000000..612e501 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_find_node_dr.3 @@ -0,0 +1,2 @@ +.\".TH IBND_FIND_NODE_DR 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_find_node_guid.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_find_node_guid.3 b/infiniband-diags/libibnetdisc/man/ibnd_find_node_guid.3 new file mode 100644 index 0000000..676b528 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_find_node_guid.3 @@ -0,0 +1,25 @@ +.TH IBND_FIND_NODE_GUID 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_find_node_guid, ibnd_find_node_dr \- given a fabric object find the node object within it which matches the guid or directed route specified. + +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid)" +.BI "ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str)" + +.SH "DESCRIPTION" +.B ibnd_find_node_guid() +Given a fabric object and a guid, return the ibnd_node_t object with that node guid. +.B ibnd_find_node_dr() +Given a fabric object and a directed route, return the ibnd_node_t object with +that directed route. + +.SH "RETURN VALUE" +.B ibnd_find_node_guid(), ibnd_find_node_dr() +return NULL on failure, otherwise a valid ibnd_node_t object. + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes.3 b/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes.3 new file mode 100644 index 0000000..7199dfb --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes.3 @@ -0,0 +1,24 @@ +.TH IBND_ITER_NODES 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_iter_nodes, ibnd_iter_nodes_type \- given a fabric object and a function itterate over the nodes in the fabric. + +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "void ibnd_iter_nodes(ibnd_fabric_t *fabric, ibnd_iter_func_t func, void *user_data)" +.BI "void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, ibnd_iter_func_t func, ibnd_node_type_t type, void *user_data)" + +.SH "DESCRIPTION" +.B ibnd_iter_nodes() +Itterate through all the nodes in the fabric and call "func" on them. +.B ibnd_iter_nodes_type() +The same as ibnd_iter_nodes except to limit the iteration to the nodes with the specified type. + +.SH "RETURN VALUE" +.B ibnd_iter_nodes(), ibnd_iter_nodes_type() +NONE + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes_type.3 b/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes_type.3 new file mode 100644 index 0000000..878547c --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes_type.3 @@ -0,0 +1,2 @@ +.\".TH IBND_FIND_NODES_TYPE 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_find_nodes.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_linkspeed_str.3 b/infiniband-diags/libibnetdisc/man/ibnd_linkspeed_str.3 new file mode 100644 index 0000000..128cd3e --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_linkspeed_str.3 @@ -0,0 +1,2 @@ +.\".TH IBND_LINKSPEED_STR 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_linkwidth_str.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_linkstate_str.3 b/infiniband-diags/libibnetdisc/man/ibnd_linkstate_str.3 new file mode 100644 index 0000000..2fa9189 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_linkstate_str.3 @@ -0,0 +1,2 @@ +.\".TH IBND_LINKSTATE_STR 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_linkwidth_str.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_linkwidth_str.3 b/infiniband-diags/libibnetdisc/man/ibnd_linkwidth_str.3 new file mode 100644 index 0000000..2cd4f0a --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_linkwidth_str.3 @@ -0,0 +1,26 @@ +.TH IBND_LINKWIDTH_STR 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_linkwidth_str, ibnd_linkspeed_str, ibnd_linkstate_str, ibnd_physstate_str, ibnd_node_type_str \- prety string functions. + +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI +.BI "char *ibnd_linkwidth_str(int link_width)" +.BI "char *ibnd_linkspeed_str(int link_speed)" +.BI "char *ibnd_linkstate_str(int link_state)" +.BI "char *ibnd_physstate_str(int phys_state)" +.BI "const char *ibnd_node_type_str(ibnd_node_t *node)" +.BI "const char *ibnd_node_type_str_short(ibnd_node_t *node)" + +.SH "DESCRIPTION" +Return user readable strings for the values given. + +.BI "const char *ibnd_node_type_str_short(ibnd_node_t *node)" +Returns a shorter abbreviated version of the string. + + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/man/ibnd_node_type_str.3 b/infiniband-diags/libibnetdisc/man/ibnd_node_type_str.3 new file mode 100644 index 0000000..77dbf07 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_node_type_str.3 @@ -0,0 +1,2 @@ +.\".TH IBND_NODE_TYPE_STR 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_linkwidth_str.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_node_type_str_short.3 b/infiniband-diags/libibnetdisc/man/ibnd_node_type_str_short.3 new file mode 100644 index 0000000..62feb6e --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_node_type_str_short.3 @@ -0,0 +1,2 @@ +.\".TH IBND_NODE_TYPE_STR_SHORT 3 "Aug 05, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_linkwidth_str.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_physstate_str.3 b/infiniband-diags/libibnetdisc/man/ibnd_physstate_str.3 new file mode 100644 index 0000000..aeeaeb7 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_physstate_str.3 @@ -0,0 +1,2 @@ +.\".TH IBND_PHYSSTATE_STR 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_physstate_str.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_show_progress.3 b/infiniband-diags/libibnetdisc/man/ibnd_show_progress.3 new file mode 100644 index 0000000..280af31 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_show_progress.3 @@ -0,0 +1,2 @@ +.\".TH IBND_SHOW_PROGRESS 3 "Nov 26, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_update_node.3 b/infiniband-diags/libibnetdisc/man/ibnd_update_node.3 new file mode 100644 index 0000000..d3aa206 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_update_node.3 @@ -0,0 +1,21 @@ +.TH IBND_UPDATE_NODE 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_update_node \- Update the node specified with new data from the fabric. + +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "ibnd_node_t *ibnd_update_node(ibnd_node_t *node)" + +.SH "DESCRIPTION" +.B ibnd_update_node() +Update the node info, port info, and node description of the node specified. + +.SH "RETURN VALUE" +.B ibnd_update_node() +Return NULL on failure, otherwise a valid ibnd_node_t object which is part of the fabric object. + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c new file mode 100644 index 0000000..c43e57e --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -0,0 +1,817 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/*========================================================*/ +/* FABRIC SCANNER SPECIFIC DATA */ +/*========================================================*/ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include +#include + +#include + +#include "internal.h" +#include "chassis.h" + +static char *ChassisTypeStr[5] = { "", "ISR9288", "ISR9096", "ISR2012", "ISR2004" }; +static char *ChassisSlotTypeStr[4] = { "", "Line", "Spine", "SRBD" }; + +char *ibnd_get_chassis_type(ibnd_node_t *node) +{ + /* Currently, only if Voltaire chassis */ + if (node->info.vendid != VTR_VENDOR_ID) + return (NULL); + if (!node->chassis) + return (NULL); + if (node->ch_type == UNRESOLVED_CT + || node->ch_type > ISR2004_CT) + return (NULL); + return ChassisTypeStr[node->ch_type]; +} + +char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size) +{ + /* Currently, only if Voltaire chassis */ + if (node->info.vendid != VTR_VENDOR_ID) + return (NULL); + if (!node->chassis) + return (NULL); + if (node->ch_slot == UNRESOLVED_CS + || node->ch_slot > SRBD_CS) + return (NULL); + if (!str) + return (NULL); + snprintf(str, size, "%s %d Chip %d", + ChassisSlotTypeStr[node->ch_slot], + node->ch_slotnum, + node->ch_anafanum); + return (str); +} + +static ibnd_chassis_t *find_chassisnum(struct ibnd_fabric *fabric, unsigned char chassisnum) +{ + ibnd_chassis_t *current; + + for (current = fabric->first_chassis; current; current = current->next) { + if (current->chassisnum == chassisnum) + return current; + } + + return NULL; +} + +static uint64_t topspin_chassisguid(uint64_t guid) +{ + /* Byte 3 in system image GUID is chassis type, and */ + /* Byte 4 is location ID (slot) so just mask off byte 4 */ + return guid & 0xffffffff00ffffffULL; +} + +int ibnd_is_xsigo_guid(uint64_t guid) +{ + if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) + return 1; + else + return 0; +} + +static int is_xsigo_leafone(uint64_t guid) +{ + if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) + return 1; + else + return 0; +} + +int ibnd_is_xsigo_hca(uint64_t guid) +{ + /* NodeType 2 is HCA */ + if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) + return 1; + else + return 0; +} + +int ibnd_is_xsigo_tca(uint64_t guid) +{ + /* NodeType 3 is TCA */ + if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) + return 1; + else + return 0; +} + +static int is_xsigo_ca(uint64_t guid) +{ + if (ibnd_is_xsigo_hca(guid) || ibnd_is_xsigo_tca(guid)) + return 1; + else + return 0; +} + +static int is_xsigo_switch(uint64_t guid) +{ + if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) + return 1; + else + return 0; +} + +static uint64_t xsigo_chassisguid(ibnd_node_t *node) +{ + if (!is_xsigo_ca(node->info.sysimgguid)) { + /* Byte 3 is NodeType and byte 4 is PortType */ + /* If NodeType is 1 (switch), PortType is masked */ + if (is_xsigo_switch(node->info.sysimgguid)) + return node->info.sysimgguid & 0xffffffff00ffffffULL; + else + return node->info.sysimgguid; + } else { + if (!node->ports || !node->ports[1]) + return (0); + + /* Is there a peer port ? */ + if (!node->ports[1]->remoteport) + return node->info.sysimgguid; + + /* If peer port is Leaf 1, use its chassis GUID */ + if (is_xsigo_leafone(node->ports[1]->remoteport->node->info.sysimgguid)) + return node->ports[1]->remoteport->node->info.sysimgguid & + 0xffffffff00ffffffULL; + else + return node->info.sysimgguid; + } +} + +static uint64_t get_chassisguid(ibnd_node_t *node) +{ + if (node->info.vendid == TS_VENDOR_ID || node->info.vendid == SS_VENDOR_ID) + return topspin_chassisguid(node->info.sysimgguid); + else if (node->info.vendid == XS_VENDOR_ID || ibnd_is_xsigo_guid(node->info.sysimgguid)) + return xsigo_chassisguid(node); + else + return node->info.sysimgguid; +} + +static ibnd_chassis_t *find_chassisguid(ibnd_node_t *node) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(node->fabric); + ibnd_chassis_t *current; + uint64_t chguid; + + chguid = get_chassisguid(node); + for (current = f->first_chassis; current; current = current->next) { + if (current->chassisguid == chguid) + return current; + } + + return NULL; +} + +uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + ibnd_chassis_t *chassis; + + chassis = find_chassisnum(f, chassisnum); + if (chassis) + return chassis->chassisguid; + else + return 0; +} + +static int is_router(struct ibnd_node *n) +{ + return (n->node.info.devid == VTR_DEVID_IB_FC_ROUTER || + n->node.info.devid == VTR_DEVID_IB_IP_ROUTER); +} + +static int is_spine_9096(struct ibnd_node *n) +{ + return (n->node.info.devid == VTR_DEVID_SFB4 || + n->node.info.devid == VTR_DEVID_SFB4_DDR); +} + +static int is_spine_9288(struct ibnd_node *n) +{ + return (n->node.info.devid == VTR_DEVID_SFB12 || + n->node.info.devid == VTR_DEVID_SFB12_DDR); +} + +static int is_spine_2004(struct ibnd_node *n) +{ + return (n->node.info.devid == VTR_DEVID_SFB2004); +} + +static int is_spine_2012(struct ibnd_node *n) +{ + return (n->node.info.devid == VTR_DEVID_SFB2012); +} + +static int is_spine(struct ibnd_node *n) +{ + return (is_spine_9096(n) || is_spine_9288(n) || + is_spine_2004(n) || is_spine_2012(n)); +} + +static int is_line_24(struct ibnd_node *n) +{ + return (n->node.info.devid == VTR_DEVID_SLB24 || + n->node.info.devid == VTR_DEVID_SLB24_DDR || + n->node.info.devid == VTR_DEVID_SRB2004); +} + +static int is_line_8(struct ibnd_node *n) +{ + return (n->node.info.devid == VTR_DEVID_SLB8); +} + +static int is_line_2024(struct ibnd_node *n) +{ + return (n->node.info.devid == VTR_DEVID_SLB2024); +} + +static int is_line(struct ibnd_node *n) +{ + return (is_line_24(n) || is_line_8(n) || is_line_2024(n)); +} + +int is_chassis_switch(struct ibnd_node *n) +{ + return (is_spine(n) || is_line(n)); +} + +/* these structs help find Line (Anafa) slot number while using spine portnum */ +int line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; +int anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; +int line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; +int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; + +/* IPR FCR modules connectivity while using sFB4 port as reference */ +int ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; + +/* these structs help find Spine (Anafa) slot number while using spine portnum */ +int spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +int spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ + +static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) +{ + ibnd_node_t *n = (ibnd_node_t *)node; + + n->ch_slot = SPINE_CS; + if (is_spine_9096(node)) { + n->ch_type = ISR9096_CT; + n->ch_slotnum = spine4_slot_2_slb[lineport->portnum]; + n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; + } else if (is_spine_9288(node)) { + n->ch_type = ISR9288_CT; + n->ch_slotnum = spine12_slot_2_slb[lineport->portnum]; + n->ch_anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; + } else if (is_spine_2012(node)) { + n->ch_type = ISR2012_CT; + n->ch_slotnum = spine12_slot_2_slb[lineport->portnum]; + n->ch_anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; + } else if (is_spine_2004(node)) { + n->ch_type = ISR2004_CT; + n->ch_slotnum = spine4_slot_2_slb[lineport->portnum]; + n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; + } else { + IBPANIC("Unexpected node found: guid 0x%016" PRIx64, + node->node.info.nodeguid); + } +} + +static void get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport) +{ + ibnd_node_t *n = (ibnd_node_t *)node; + int guessnum = 0; + + node->ch_found = 1; + + n->ch_slot = SRBD_CS; + if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR9096_CT; + n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; + n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; + } else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR9288_CT; + n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; + /* this is a smart guess based on nodeguids order on sFB-12 module */ + guessnum = spineport->node->info.nodeguid % 4; + /* module 1 <--> remote anafa 3 */ + /* module 2 <--> remote anafa 2 */ + /* module 3 <--> remote anafa 1 */ + n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); + } else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR2012_CT; + n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; + /* this is a smart guess based on nodeguids order on sFB-12 module */ + guessnum = spineport->node->info.nodeguid % 4; + // module 1 <--> remote anafa 3 + // module 2 <--> remote anafa 2 + // module 3 <--> remote anafa 1 + n->ch_anafanum = (guessnum == 3? 1 : (guessnum == 1 ? 3 : 2)); + } else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR2004_CT; + n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; + n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; + } else { + IBPANIC("Unexpected node found: guid 0x%016" PRIx64, + spineport->node->info.nodeguid); + } +} + +static void get_slb_slot(ibnd_node_t *n, ibnd_port_t *spineport) +{ + n->ch_slot = LINE_CS; + if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR9096_CT; + n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; + n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; + } else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR9288_CT; + n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; + n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; + } else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR2012_CT; + n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; + n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; + } else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR2004_CT; + n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; + n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; + } else { + IBPANIC("Unexpected node found: guid 0x%016" PRIx64, + spineport->node->info.nodeguid); + } +} + +/* forward declare this */ +static void voltaire_portmap(ibnd_port_t *port); +/* + This function called for every Voltaire node in fabric + It could be optimized so, but time overhead is very small + and its only diag.util +*/ +static void fill_voltaire_chassis_record(struct ibnd_node *node) +{ + ibnd_node_t *n = (ibnd_node_t *)node; + int p = 0; + ibnd_port_t *port; + struct ibnd_node *remnode = 0; + + if (node->ch_found) /* somehow this node has already been passed */ + return; + node->ch_found = 1; + + /* node is router only in case of using unique lid */ + /* (which is lid of chassis router port) */ + /* in such case node->ports is actually a requested port... */ + if (is_router(node)) { + /* find the remote node */ + for (p = 1; p <= node->node.info.numports; p++) { + port = node->node.ports[p]; + if (port && is_spine(CONV_NODE_INTERNAL(port->remoteport->node))) + get_router_slot(node, port->remoteport); + } + } else if (is_spine(node)) { + for (p = 1; p <= node->node.info.numports; p++) { + port = node->node.ports[p]; + if (!port || !port->remoteport) + continue; + remnode = CONV_NODE_INTERNAL(port->remoteport->node); + if (remnode->node.info.type != IBND_SWITCH_NODE) { + if (!remnode->ch_found) + get_router_slot(remnode, port); + continue; + } + if (!n->ch_type) + /* we assume here that remoteport belongs to line */ + get_sfb_slot(node, port->remoteport); + + /* we could break here, but need to find if more routers connected */ + } + + } else if (is_line(node)) { + for (p = 1; p <= node->node.info.numports; p++) { + port = node->node.ports[p]; + if (!port || port->portnum > 12 || !port->remoteport) + continue; + /* we assume here that remoteport belongs to spine */ + get_slb_slot(n, port->remoteport); + break; + } + } + + /* for each port of this node, map external ports */ + for (p = 1; p <= node->node.info.numports; p++) { + port = node->node.ports[p]; + if (!port) + continue; + voltaire_portmap(port); + } + + return; +} + +static int get_line_index(ibnd_node_t *node) +{ + int retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum; + + if (retval > LINES_MAX_NUM || retval < 1) + IBPANIC("Internal error"); + return retval; +} + +static int get_spine_index(ibnd_node_t *node) +{ + int retval; + + if (is_spine_9288(CONV_NODE_INTERNAL(node)) || is_spine_2012(CONV_NODE_INTERNAL(node))) + retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum; + else + retval = node->ch_slotnum; + + if (retval > SPINES_MAX_NUM || retval < 1) + IBPANIC("Internal error"); + return retval; +} + +static void insert_line_router(ibnd_node_t *node, ibnd_chassis_t *chassis) +{ + int i = get_line_index(node); + + if (chassis->linenode[i]) + return; /* already filled slot */ + + chassis->linenode[i] = node; + node->chassis = chassis; +} + +static void insert_spine(ibnd_node_t *node, ibnd_chassis_t *chassis) +{ + int i = get_spine_index(node); + + if (chassis->spinenode[i]) + return; /* already filled slot */ + + chassis->spinenode[i] = node; + node->chassis = chassis; +} + +static void pass_on_lines_catch_spines(ibnd_chassis_t *chassis) +{ + ibnd_node_t *node, *remnode; + ibnd_port_t *port; + int i, p; + + for (i = 1; i <= LINES_MAX_NUM; i++) { + node = chassis->linenode[i]; + + if (!(node && is_line(CONV_NODE_INTERNAL(node)))) + continue; /* empty slot or router */ + + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (!port || port->portnum > 12 || !port->remoteport) + continue; + + remnode = port->remoteport->node; + + if (!CONV_NODE_INTERNAL(remnode)->ch_found) + continue; /* some error - spine not initialized ? FIXME */ + insert_spine(remnode, chassis); + } + } +} + +static void pass_on_spines_catch_lines(ibnd_chassis_t *chassis) +{ + ibnd_node_t *node, *remnode; + ibnd_port_t *port; + int i, p; + + for (i = 1; i <= SPINES_MAX_NUM; i++) { + node = chassis->spinenode[i]; + if (!node) + continue; /* empty slot */ + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (!port || !port->remoteport) + continue; + remnode = port->remoteport->node; + + if (!CONV_NODE_INTERNAL(remnode)->ch_found) + continue; /* some error - line/router not initialized ? FIXME */ + insert_line_router(remnode, chassis); + } + } +} + +/* + Stupid interpolation algorithm... + But nothing to do - have to be compliant with VoltaireSM/NMS +*/ +static void pass_on_spines_interpolate_chguid(ibnd_chassis_t *chassis) +{ + ibnd_node_t *node; + int i; + + for (i = 1; i <= SPINES_MAX_NUM; i++) { + node = chassis->spinenode[i]; + if (!node) + continue; /* skip the empty slots */ + + /* take first guid minus one to be consistent with SM */ + chassis->chassisguid = node->info.nodeguid - 1; + break; + } +} + +/* + This function fills chassis structure with all nodes + in that chassis + chassis structure = structure of one standalone chassis +*/ +static void build_chassis(struct ibnd_node *node, ibnd_chassis_t *chassis) +{ + int p = 0; + struct ibnd_node *remnode = 0; + ibnd_port_t *port = 0; + + /* we get here with node = chassis_spine */ + insert_spine((ibnd_node_t *)node, chassis); + + /* loop: pass on all ports of node */ + for (p = 1; p <= node->node.info.numports; p++ ) { + port = node->node.ports[p]; + if (!port || !port->remoteport) + continue; + remnode = CONV_NODE_INTERNAL(port->remoteport->node); + + if (!remnode->ch_found) + continue; /* some error - line or router not initialized ? FIXME */ + + insert_line_router(&(remnode->node), chassis); + } + + pass_on_lines_catch_spines(chassis); + /* this pass needed for to catch routers, since routers connected only */ + /* to spines in slot 1 or 4 and we could miss them first time */ + pass_on_spines_catch_lines(chassis); + + /* additional 2 passes needed for to overcome a problem of pure "in-chassis" */ + /* connectivity - extra pass to ensure that all related chips/modules */ + /* inserted into the chassis */ + pass_on_lines_catch_spines(chassis); + pass_on_spines_catch_lines(chassis); + pass_on_spines_interpolate_chguid(chassis); +} + +/*========================================================*/ +/* INTERNAL TO EXTERNAL PORT MAPPING */ +/*========================================================*/ + +/* +Description : On ISR9288/9096 external ports indexing + is not matching the internal ( anafa ) port + indexes. Use this MAP to translate the data you get from + the OpenIB diagnostics (smpquery, ibroute, ibtracert, etc.) + + +Module : sLB-24 + anafa 1 anafa 2 +ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 +int port | 22 23 24 18 17 16 | 22 23 24 18 17 16 +ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 +int port | 19 20 21 15 14 13 | 19 20 21 15 14 13 +------------------------------------------------ + +Module : sLB-8 + anafa 1 anafa 2 +ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 +int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 +ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 +int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 + +-----------> + anafa 1 anafa 2 +ext port | - - 5 - - 6 | - - 7 - - 8 +int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 +ext port | - - 1 - - 2 | - - 3 - - 4 +int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 +------------------------------------------------ + +Module : sLB-2024 + +ext port | 13 14 15 16 17 18 19 20 21 22 23 24 +A1 int port| 13 14 15 16 17 18 19 20 21 22 23 24 +ext port | 1 2 3 4 5 6 7 8 9 10 11 12 +A2 int port| 13 14 15 16 17 18 19 20 21 22 23 24 +--------------------------------------------------- + +*/ + +int int2ext_map_slb24[2][25] = { + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 4, 18, 17, 16, 1, 2, 3, 13, 14, 15 }, + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 11, 10, 24, 23, 22, 7, 8, 9, 19, 20, 21 } + }; +int int2ext_map_slb8[2][25] = { + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 6, 6, 6, 1, 1, 1, 5, 5, 5 }, + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 8, 8, 8, 3, 3, 3, 7, 7, 7 } + }; +int int2ext_map_slb2024[2][25] = { + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }, + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } + }; +/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ + +/* map internal ports to external ports if appropriate */ +static void +voltaire_portmap(ibnd_port_t *port) +{ + struct ibnd_node *n = CONV_NODE_INTERNAL(port->node); + int portnum = port->portnum; + int chipnum = 0; + ibnd_node_t *node = port->node; + + if (!n->ch_found || !is_line(CONV_NODE_INTERNAL(node)) || (portnum < 13 || portnum > 24)) { + port->ext_portnum = 0; + return; + } + + if (port->node->ch_anafanum < 1 || port->node->ch_anafanum > 2) { + port->ext_portnum = 0; + return; + } + + chipnum = port->node->ch_anafanum - 1; + + if (is_line_24(CONV_NODE_INTERNAL(node))) + port->ext_portnum = int2ext_map_slb24[chipnum][portnum]; + else if (is_line_2024(CONV_NODE_INTERNAL(node))) + port->ext_portnum = int2ext_map_slb2024[chipnum][portnum]; + else + port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; +} + +static void add_chassis(struct ibnd_fabric *fabric) +{ + if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) + IBPANIC("out of mem"); + + if (fabric->first_chassis == NULL) { + fabric->first_chassis = fabric->current_chassis; + fabric->last_chassis = fabric->current_chassis; + } else { + fabric->last_chassis->next = fabric->current_chassis; + fabric->last_chassis = fabric->current_chassis; + } +} + +static void +add_node_to_chassis(ibnd_chassis_t *chassis, ibnd_node_t *node) +{ + node->chassis = chassis; + node->next_chassis_node = chassis->nodes; + chassis->nodes = node; +} + +/* + Main grouping function + Algorithm: + 1. pass on every Voltaire node + 2. catch spine chip for every Voltaire node + 2.1 build/interpolate chassis around this chip + 2.2 go to 1. + 3. pass on non Voltaire nodes (SystemImageGUID based grouping) + 4. now group non Voltaire nodes by SystemImageGUID + Returns: + Pointer to the first chassis in a NULL terminated list of chassis in + the fabric specified. +*/ +ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric) +{ + struct ibnd_node *node; + int dist; + int chassisnum = 0; + ibnd_chassis_t *chassis; + + fabric->first_chassis = NULL; + fabric->current_chassis = NULL; + + /* first pass on switches and build for every Voltaire node */ + /* an appropriate chassis record (slotnum and position) */ + /* according to internal connectivity */ + /* not very efficient but clear code so... */ + for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + if (node->node.info.vendid == VTR_VENDOR_ID) + fill_voltaire_chassis_record(node); + } + } + + /* separate every Voltaire chassis from each other and build linked list of them */ + /* algorithm: catch spine and find all surrounding nodes */ + for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + if (node->node.info.vendid != VTR_VENDOR_ID) + continue; + //if (!node->node.chrecord || node->node.chrecord->chassisnum || !is_spine(node)) + if (!node->ch_found + || (node->node.chassis && node->node.chassis->chassisnum) + || !is_spine(node)) + continue; + add_chassis(fabric); + fabric->current_chassis->chassisnum = ++chassisnum; + build_chassis(node, fabric->current_chassis); + } + } + + /* now make pass on nodes for chassis which are not Voltaire */ + /* grouped by common SystemImageGUID */ + for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + if (node->node.info.vendid == VTR_VENDOR_ID) + continue; + if (node->node.info.sysimgguid) { + chassis = find_chassisguid((ibnd_node_t *)node); + if (chassis) + chassis->nodecount++; + else { + /* Possible new chassis */ + add_chassis(fabric); + fabric->current_chassis->chassisguid = + get_chassisguid((ibnd_node_t *)node); + fabric->current_chassis->nodecount = 1; + } + } + } + } + + /* now, make another pass to see which nodes are part of chassis */ + /* (defined as chassis->nodecount > 1) */ + for (dist = 0; dist <= MAXHOPS; ) { + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + if (node->node.info.vendid == VTR_VENDOR_ID) + continue; + if (node->node.info.sysimgguid) { + chassis = find_chassisguid((ibnd_node_t *)node); + if (chassis && chassis->nodecount > 1) { + if (!chassis->chassisnum) + chassis->chassisnum = ++chassisnum; + if (!node->ch_found) { + node->ch_found = 1; + add_node_to_chassis(chassis, (ibnd_node_t *)node); + } + } + } + } + if (dist == fabric->fabric.maxhops_discovered) + dist = MAXHOPS; /* skip to CAs */ + else + dist++; + } + + return (fabric->first_chassis); +} diff --git a/infiniband-diags/libibnetdisc/src/chassis.h b/infiniband-diags/libibnetdisc/src/chassis.h new file mode 100644 index 0000000..16dad49 --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/chassis.h @@ -0,0 +1,85 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _CHASSIS_H_ +#define _CHASSIS_H_ + +#include + +#include "internal.h" + +/*========================================================*/ +/* CHASSIS RECOGNITION SPECIFIC DATA */ +/*========================================================*/ + +/* Device IDs */ +#define VTR_DEVID_IB_FC_ROUTER 0x5a00 +#define VTR_DEVID_IB_IP_ROUTER 0x5a01 +#define VTR_DEVID_ISR9600_SPINE 0x5a02 +#define VTR_DEVID_ISR9600_LEAF 0x5a03 +#define VTR_DEVID_HCA1 0x5a04 +#define VTR_DEVID_HCA2 0x5a44 +#define VTR_DEVID_HCA3 0x6278 +#define VTR_DEVID_SW_6IB4 0x5a05 +#define VTR_DEVID_ISR9024 0x5a06 +#define VTR_DEVID_ISR9288 0x5a07 +#define VTR_DEVID_SLB24 0x5a09 +#define VTR_DEVID_SFB12 0x5a08 +#define VTR_DEVID_SFB4 0x5a0b +#define VTR_DEVID_ISR9024_12 0x5a0c +#define VTR_DEVID_SLB8 0x5a0d +#define VTR_DEVID_RLX_SWITCH_BLADE 0x5a20 +#define VTR_DEVID_ISR9024_DDR 0x5a31 +#define VTR_DEVID_SFB12_DDR 0x5a32 +#define VTR_DEVID_SFB4_DDR 0x5a33 +#define VTR_DEVID_SLB24_DDR 0x5a34 +#define VTR_DEVID_SFB2012 0x5a37 +#define VTR_DEVID_SLB2024 0x5a38 +#define VTR_DEVID_ISR2012 0x5a39 +#define VTR_DEVID_SFB2004 0x5a40 +#define VTR_DEVID_ISR2004 0x5a41 +#define VTR_DEVID_SRB2004 0x5a42 + +/* Vendor IDs (for chassis based systems) */ +#define VTR_VENDOR_ID 0x8f1 /* Voltaire */ +#define TS_VENDOR_ID 0x5ad /* Cisco */ +#define SS_VENDOR_ID 0x66a /* InfiniCon */ +#define XS_VENDOR_ID 0x1397 /* Xsigo */ + +enum ibnd_chassis_type { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT }; +enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS }; + +ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric); + +#endif /* _CHASSIS_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c new file mode 100644 index 0000000..29f691c --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -0,0 +1,871 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Laboratory + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include +#include + +#include "internal.h" +#include "chassis.h" + +static int timeout_ms = 2000; +static int show_progress = 0; + +static char *linkwidth_str[] = { + "??", + "1x", + "4x", + "??", + "8x", + "??", + "??", + "??", + "12x" +}; + +static char *linkspeed_str[] = { + "???", + "SDR", + "DDR", + "???", + "QDR" +}; + +static char *linkspeed_datarate_str[] = { + "???", + "2.5 Gbps", + "5.0 Gbps", + "???", + "10.0 Gbps" +}; + +static char *linkstate_str[] = { + "No State", + "Down", + "Init", + "Armed", + "Active" +}; + +static char *physstate_str[] = { + "No State", + "Sleep", + "Polling", + "Disabled", + "PortConfigTraining", + "LinkUp", + "LinkErrorRecovery", + "Phy Test" +}; + +char * +ibnd_linkwidth_str(int link_width) +{ + if (link_width > 8) + return linkwidth_str[0]; + else + return linkwidth_str[link_width]; +} + +char * +ibnd_linkspeed_str(int link_speed, int data_rate) +{ + if (link_speed > 4) + return linkspeed_str[0]; + else if (data_rate) + return linkspeed_datarate_str[link_speed]; + else + return linkspeed_str[link_speed]; +} +char * +ibnd_linkstate_str(int link_state) +{ + if (link_state > 4) + return linkstate_str[0]; + else + return linkstate_str[link_state]; +} + +char * +ibnd_physstate_str(int phys_state) +{ + if (phys_state > 7) + return physstate_str[0]; + else + return physstate_str[phys_state]; +} + +void +decode_port_info(void * rcv_buf, ibnd_port_info_t *pi) +{ + mad_decode_field(rcv_buf, IB_PORT_LID_F, &pi->base_lid); + mad_decode_field(rcv_buf, IB_PORT_SMLID_F, &pi->smlid); + + mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_SUPPORTED_F, &pi->link_speed_supported); + mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_ENABLED_F, &pi->link_speed_enabled); + mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_ACTIVE_F, &pi->link_speed_active); + + mad_decode_field(rcv_buf, IB_PORT_LOCAL_PORT_F, &pi->local_port); + mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_SUPPORTED_F, &pi->link_width_supported); + mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_ENABLED_F, &pi->link_width_enabled); + + mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_ACTIVE_F, &pi->link_width_active); + + mad_decode_field(rcv_buf, IB_PORT_DIAG_F, &pi->diag_code); + mad_decode_field(rcv_buf, IB_PORT_MKEY_LEASE_F, &pi->mkey_lease); + mad_decode_field(rcv_buf, IB_PORT_CAPMASK_F, &pi->capability_mask); + mad_decode_field(rcv_buf, IB_PORT_MKEY_F, &pi->mkey); + mad_decode_field(rcv_buf, IB_PORT_GID_PREFIX_F, &pi->gid_prefix); + + mad_decode_field(rcv_buf, IB_PORT_STATE_F, &pi->port_state); + mad_decode_field(rcv_buf, IB_PORT_PHYS_STATE_F, &pi->phys_state); + + mad_decode_field(rcv_buf, IB_PORT_LINK_DOWN_DEF_F, &pi->link_down_def_state); + mad_decode_field(rcv_buf, IB_PORT_MKEY_PROT_BITS_F, &pi->mkey_prot_bits); + + mad_decode_field(rcv_buf, IB_PORT_LMC_F, &pi->lmc); + mad_decode_field(rcv_buf, IB_PORT_NEIGHBOR_MTU_F, &pi->neighbor_mtu); + mad_decode_field(rcv_buf, IB_PORT_SMSL_F, &pi->smsl); + mad_decode_field(rcv_buf, IB_PORT_INIT_TYPE_F, &pi->init_type); + + mad_decode_field(rcv_buf, IB_PORT_VL_CAP_F, &pi->vl_capability); + mad_decode_field(rcv_buf, IB_PORT_VL_HIGH_LIMIT_F, &pi->vl_high_limit); + mad_decode_field(rcv_buf, IB_PORT_VL_ARBITRATION_HIGH_CAP_F, &pi->vl_arb_high_cap); + mad_decode_field(rcv_buf, IB_PORT_VL_ARBITRATION_LOW_CAP_F, &pi->vl_arb_low_cap); + + mad_decode_field(rcv_buf, IB_PORT_INIT_TYPE_REPLY_F, &pi->init_reply); + mad_decode_field(rcv_buf, IB_PORT_MTU_CAP_F, &pi->mtu_cap); + mad_decode_field(rcv_buf, IB_PORT_VL_STALL_COUNT_F, &pi->vl_stall_count); + mad_decode_field(rcv_buf, IB_PORT_HOQ_LIFE_F, &pi->hoq_lifetime); + mad_decode_field(rcv_buf, IB_PORT_OPER_VLS_F, &pi->oper_vls); + mad_decode_field(rcv_buf, IB_PORT_PART_EN_INB_F, &pi->partition_enforce_in); + mad_decode_field(rcv_buf, IB_PORT_PART_EN_OUTB_F, &pi->partition_enforce_out); + mad_decode_field(rcv_buf, IB_PORT_FILTER_RAW_INB_F, &pi->filter_raw_in); + mad_decode_field(rcv_buf, IB_PORT_FILTER_RAW_OUTB_F, &pi->filter_raw_out); + mad_decode_field(rcv_buf, IB_PORT_MKEY_VIOL_F, &pi->mkey_violations); + mad_decode_field(rcv_buf, IB_PORT_PKEY_VIOL_F, &pi->pkey_violations); + mad_decode_field(rcv_buf, IB_PORT_QKEY_VIOL_F, &pi->qkey_violations); + + mad_decode_field(rcv_buf, IB_PORT_GUID_CAP_F, &pi->guid_capabilities); + + mad_decode_field(rcv_buf, IB_PORT_CLIENT_REREG_F, &pi->client_rereg); + mad_decode_field(rcv_buf, IB_PORT_SUBN_TIMEOUT_F, &pi->subnet_timeout); + mad_decode_field(rcv_buf, IB_PORT_RESP_TIME_VAL_F, &pi->response_time_val); + mad_decode_field(rcv_buf, IB_PORT_LOCAL_PHYS_ERR_F, &pi->local_phys_error); + mad_decode_field(rcv_buf, IB_PORT_OVERRUN_ERR_F, &pi->overrun_error); + mad_decode_field(rcv_buf, IB_PORT_MAX_CREDIT_HINT_F, &pi->max_credit_hint); + mad_decode_field(rcv_buf, IB_PORT_LINK_ROUND_TRIP_F, &pi->link_round_trip); +} + +static int +get_port_info(struct ibnd_fabric *fabric, struct ibnd_port *port, + int portnum, ib_portid_t *portid) +{ + char portinfo[64]; + void *pi = portinfo; + + port->port.portnum = portnum; + + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout_ms, + fabric->ibmad_port)) + return -1; + + decode_port_info(pi, &port->port.info); + + IBND_DEBUG("portid %s portnum %d: base lid %d state %d physstate %d %s %s\n", + portid2str(portid), portnum, port->port.info.base_lid, port->port.info.port_state, + port->port.info.phys_state, ibnd_linkwidth_str(port->port.info.link_width_active), + ibnd_linkspeed_str(port->port.info.link_speed_active, 0)); + return 1; +} + +static void +decode_node_info(void * rcv_buf, ibnd_node_info_t *ni) +{ + mad_decode_field(rcv_buf, IB_NODE_BASE_VERS_F, &ni->base_ver); + mad_decode_field(rcv_buf, IB_NODE_CLASS_VERS_F, &ni->class_ver); + mad_decode_field(rcv_buf, IB_NODE_TYPE_F, &ni->type); + mad_decode_field(rcv_buf, IB_NODE_NPORTS_F, &ni->numports); + mad_decode_field(rcv_buf, IB_NODE_SYSTEM_GUID_F, &ni->sysimgguid); + mad_decode_field(rcv_buf, IB_NODE_GUID_F, &ni->nodeguid); + mad_decode_field(rcv_buf, IB_NODE_PORT_GUID_F, &ni->nodeportguid); + mad_decode_field(rcv_buf, IB_NODE_PARTITION_CAP_F, &ni->partition_cap); + mad_decode_field(rcv_buf, IB_NODE_DEVID_F, &ni->devid); + mad_decode_field(rcv_buf, IB_NODE_REVISION_F, &ni->revision); + mad_decode_field(rcv_buf, IB_NODE_LOCAL_PORT_F, &ni->localport); + mad_decode_field(rcv_buf, IB_NODE_VENDORID_F, &ni->vendid); +} + +/* + * Returns -1 if error. + */ +static int +query_node_info(struct ibnd_fabric *fabric, struct ibnd_node *node, ib_portid_t *portid) +{ + char nodeinfo[64]; + void *ni = nodeinfo; + if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout_ms, + fabric->ibmad_port)) + return -1; + decode_node_info(ni, &(node->node.info)); + return (0); +} + +/* + * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. + */ +static int +query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, + struct ibnd_port *iport, ib_portid_t *portid) +{ + char portinfo[64]; + void *pi = portinfo; + char switchinfo[64]; + void *si = switchinfo; + ibnd_node_t *node = &(inode->node); + ibnd_port_t *port = &(iport->port); + void *nd = inode->node.nodedesc; + + if (query_node_info(fabric, inode, portid)) + return -1; + + port->portnum = node->info.localport; + port->guid = node->info.nodeportguid; + + if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout_ms, + fabric->ibmad_port)) + return -1; + + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout_ms, + fabric->ibmad_port)) + return -1; + decode_port_info(pi, &port->info); + + if (node->info.type != IBND_SWITCH_NODE) + return 0; + + node->smalid = port->info.base_lid; + node->smalmc = port->info.lmc; + + /* after we have the sma information find out the real PortInfo for this port */ + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, node->info.localport, timeout_ms, + fabric->ibmad_port)) + return -1; + decode_port_info(pi, &port->info); + + if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, + fabric->ibmad_port)) + node->sw_info.smaenhsp0 = 0; /* assume base SP0 */ + else + mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->sw_info.smaenhsp0); + + IBND_DEBUG("portid %s: got switch node %" PRIx64 " '%s'\n", + portid2str(portid), node->info.nodeguid, node->nodedesc); + return 1; +} + +static int +add_port_to_dpath(ib_dr_path_t *path, int nextport) +{ + if (path->cnt+2 >= sizeof(path->p)) + return -1; + ++path->cnt; + path->p[path->cnt] = nextport; + return path->cnt; +} + +static int +extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) +{ + int rc = add_port_to_dpath(path, nextport); + if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) + f->fabric.maxhops_discovered = path->cnt; + return (rc); +} + +static void +dump_endnode(ib_portid_t *path, char *prompt, + struct ibnd_node *node, struct ibnd_port *port) +{ + if (!show_progress) + return; + + printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", + portid2str(path), prompt, + ibnd_node_type_str((ibnd_node_t *)node), + node->node.info.nodeguid, + node->node.info.type == IBND_SWITCH_NODE ? 0 : port->port.portnum, + port->port.info.base_lid, port->port.info.base_lid + (1 << port->port.info.lmc) - 1, + node->node.nodedesc); +} + +static struct ibnd_node * +find_existing_node(struct ibnd_fabric *fabric, struct ibnd_node *new) +{ + int hash = HASHGUID(new->node.info.nodeguid) % HTSZ; + struct ibnd_node *node; + + for (node = fabric->nodestbl[hash]; node; node = node->htnext) + if (node->node.info.nodeguid == new->node.info.nodeguid) + return node; + + return NULL; +} + +ibnd_node_t * +ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + int hash = HASHGUID(guid) % HTSZ; + struct ibnd_node *node; + + for (node = f->nodestbl[hash]; node; node = node->htnext) + if (node->node.info.nodeguid == guid) + return (ibnd_node_t *)node; + + return NULL; +} + +ibnd_node_t * +ibnd_update_node(ibnd_node_t *node) +{ + char portinfo[64]; + void *pi = portinfo; + ibnd_port_info_t port0_info; + char switchinfo[64]; + void *si = switchinfo; + void *nd = node->nodedesc; + int p = 0; + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(node->fabric); + struct ibnd_node *n = CONV_NODE_INTERNAL(node); + + if (query_node_info(f, n, &(n->node.path_portid))) + return (NULL); + + if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, timeout_ms, + f->ibmad_port)) + return (NULL); + + /* update all the port info's */ + for (p = 1; p >= n->node.info.numports; p++) { + get_port_info(f, CONV_PORT_INTERNAL(n->node.ports[p]), p, &(n->node.path_portid)); + } + + if (n->node.info.type != IBND_SWITCH_NODE) + goto done; + + if (!smp_query_via(pi, &(n->node.path_portid), IB_ATTR_PORT_INFO, 0, timeout_ms, + f->ibmad_port)) + return (NULL); + decode_port_info(pi, &port0_info); + + n->node.smalid = port0_info.base_lid; + n->node.smalmc = port0_info.lmc; + + if (!smp_query_via(si, &(n->node.path_portid), IB_ATTR_SWITCH_INFO, 0, timeout_ms, + f->ibmad_port)) + node->sw_info.smaenhsp0 = 0; /* assume base SP0 */ + else + mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &n->node.sw_info.smaenhsp0); + +done: + return (node); +} + +ibnd_node_t * +ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + int i = 0; + ibnd_node_t *rc = f->fabric.from_node; + ib_dr_path_t path; + + if (str2drpath(&path, dr_str, 0, 0) == -1) { + return (NULL); + } + + for (i = 0; i <= path.cnt; i++) { + ibnd_port_t *remote_port = NULL; + if (path.p[i] == 0) + continue; + if (!rc->ports) + return (NULL); + + remote_port = rc->ports[path.p[i]]->remoteport; + if (!remote_port) + return (NULL); + + rc = remote_port->node; + } + + return (rc); +} + +static void +add_to_nodeguid_hash(struct ibnd_node *node, struct ibnd_node *hash[]) +{ + int hash_idx = HASHGUID(node->node.info.nodeguid) % HTSZ; + + node->htnext = hash[hash_idx]; + hash[hash_idx] = node; +} + +static void +add_to_portguid_hash(struct ibnd_port *port, struct ibnd_port *hash[]) +{ + int hash_idx = HASHGUID(port->port.guid) % HTSZ; + + port->htnext = hash[hash_idx]; + hash[hash_idx] = port; +} + +static void +add_to_type_list(struct ibnd_node*node, struct ibnd_fabric *fabric) +{ + switch (node->node.info.type) { + case IBND_CA_NODE: + node->type_next = fabric->ch_adapters; + fabric->ch_adapters = node; + break; + case IBND_SWITCH_NODE: + node->type_next = fabric->switches; + fabric->switches = node; + break; + case IBND_ROUTER_NODE: + node->type_next = fabric->routers; + fabric->routers = node; + break; + } +} + +static void +add_to_nodedist(struct ibnd_node *node, struct ibnd_fabric *fabric) +{ + int dist = node->node.dist; + if (node->node.info.type != IBND_SWITCH_NODE) + dist = MAXHOPS; /* special Ca list */ + + node->dnext = fabric->nodesdist[dist]; + fabric->nodesdist[dist] = node; +} + + +static struct ibnd_node * +create_node(struct ibnd_fabric *fabric, struct ibnd_node *temp, ib_portid_t *path, int dist) +{ + struct ibnd_node *node; + + node = malloc(sizeof(*node)); + if (!node) { + IBPANIC("OOM: node creation failed\n"); + return NULL; + } + + memcpy(node, temp, sizeof(*node)); + node->node.dist = dist; + node->node.path_portid = *path; + node->node.fabric = (ibnd_fabric_t *)fabric; + + add_to_nodeguid_hash(node, fabric->nodestbl); + + /* add this to the all nodes list */ + node->node.next = fabric->fabric.nodes; + fabric->fabric.nodes = (ibnd_node_t *)node; + + add_to_type_list(node, fabric); + add_to_nodedist(node, fabric); + + return node; +} + +static struct ibnd_port * +find_existing_port_node(struct ibnd_node *node, struct ibnd_port *port) +{ + if (port->port.portnum > node->node.info.numports || node->node.ports == NULL ) + return (NULL); + + return (CONV_PORT_INTERNAL(node->node.ports[port->port.portnum])); +} + +static struct ibnd_port * +add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_port *temp) +{ + struct ibnd_port *port; + + port = malloc(sizeof(*port)); + if (!port) + return NULL; + + memcpy(port, temp, sizeof(*port)); + port->port.node = (ibnd_node_t *)node; + port->port.ext_portnum = 0; + + if (node->node.ports == NULL) { + node->node.ports = calloc(sizeof(*node->node.ports), node->node.info.numports + 1); + if (!node->node.ports) { + IBND_ERROR("Failed to allocate the ports array\n"); + return (NULL); + } + } + + node->node.ports[temp->port.portnum] = (ibnd_port_t *)port; + + add_to_portguid_hash(port, fabric->portstbl); + return port; +} + +static void +link_ports(struct ibnd_node *node, struct ibnd_port *port, + struct ibnd_node *remotenode, struct ibnd_port *remoteport) +{ + IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u\n", + node->node.info.nodeguid, node, port, port->port.portnum, + remotenode->node.info.nodeguid, remotenode, + remoteport, remoteport->port.portnum); + if (port->port.remoteport) + port->port.remoteport->remoteport = NULL; + if (remoteport->port.remoteport) + remoteport->port.remoteport->remoteport = NULL; + port->port.remoteport = (ibnd_port_t *)remoteport; + remoteport->port.remoteport = (ibnd_port_t *)port; +} + +static int +get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_port *port, ib_portid_t *path, + int portnum, int dist) +{ + struct ibnd_node node_buf; + struct ibnd_port port_buf; + struct ibnd_node *remotenode, *oldnode; + struct ibnd_port *remoteport, *oldport; + + memset(&node_buf, 0, sizeof(node_buf)); + memset(&port_buf, 0, sizeof(port_buf)); + + IBND_DEBUG("handle node %p port %p:%d dist %d\n", node, port, portnum, dist); + if (port->port.info.phys_state != 5) /* LinkUp */ + return -1; + + if (extend_dpath(fabric, &path->drpath, portnum) < 0) + return -1; + + if (query_node(fabric, &node_buf, &port_buf, path) < 0) { + IBWARN("NodeInfo on %s failed, skipping port", + portid2str(path)); + path->drpath.cnt--; /* restore path */ + return -1; + } + + oldnode = find_existing_node(fabric, &node_buf); + if (oldnode) + remotenode = oldnode; + else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) + IBPANIC("no memory"); + + oldport = find_existing_port_node(remotenode, &port_buf); + if (oldport) { + remoteport = oldport; + } else if (!(remoteport = add_port_to_node(fabric, remotenode, &port_buf))) + IBPANIC("no memory"); + + dump_endnode(path, oldnode ? "known remote" : "new remote", + remotenode, remoteport); + + link_ports(node, port, remotenode, remoteport); + + path->drpath.cnt--; /* restore path */ + return 0; +} + +static void * +ibnd_init_port(char *dev_name, int dev_port) +{ + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + + /* Crank up the mad lib */ + return (mad_rpc_open_port(dev_name, dev_port, mgmt_classes, 2)); +} + +ibnd_fabric_t * +ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, + ib_portid_t *from, int hops) +{ + struct ibnd_fabric *fabric = NULL; + ib_portid_t my_portid = {0}; + struct ibnd_node node_buf; + struct ibnd_port port_buf; + struct ibnd_node *node; + struct ibnd_port *port; + int i; + int dist = 0; + ib_portid_t *path; + int max_hops = MAXHOPS-1; /* default find everything */ + + /* if not everything how much? */ + if (hops >= 0) { + max_hops = hops; + } + + /* If not specified start from "my" port */ + if (!from) { + from = &my_portid; + } + + fabric = malloc(sizeof(*fabric)); + + if (!fabric) { + IBPANIC("OOM: failed to malloc ibnd_fabric_t\n"); + return (NULL); + } + + memset(fabric, 0, sizeof(*fabric)); + + fabric->ibmad_port = ibnd_init_port(dev_name, dev_port); + if (!fabric->ibmad_port) { + IBPANIC("OOM: failed to open \"%s\" port %d\n", + dev_name, dev_port); + goto error; + } + + IBND_DEBUG("from %s\n", portid2str(from)); + + memset(&node_buf, 0, sizeof(node_buf)); + memset(&port_buf, 0, sizeof(port_buf)); + + if (query_node(fabric, &node_buf, &port_buf, from) < 0) { + IBWARN("can't reach node %s\n", portid2str(from)); + goto error; + } + + node = create_node(fabric, &node_buf, from, 0); + if (!node) + goto error; + + fabric->fabric.from_node = (ibnd_node_t *)node; + + port = add_port_to_node(fabric, node, &port_buf); + if (!port) + IBPANIC("out of memory"); + + if (node->node.info.type != IBND_SWITCH_NODE && + get_remote_node(fabric, node, port, from, node->node.info.localport, 0) < 0) + return ((ibnd_fabric_t *)fabric); + + for (dist = 0; dist <= max_hops; dist++) { + + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + + path = &node->node.path_portid; + + IBND_DEBUG("dist %d node %p\n", dist, node); + dump_endnode(path, "processing", node, port); + + for (i = 1; i <= node->node.info.numports; i++) { + if (i == node->node.info.localport) + continue; + + if (get_port_info(fabric, &port_buf, i, path) < 0) { + IBWARN("can't reach node %s port %d", portid2str(path), i); + continue; + } + + port = find_existing_port_node(node, &port_buf); + if (port) + continue; + + port = add_port_to_node(fabric, node, &port_buf); + if (!port) + IBPANIC("out of memory"); + + /* If switch, set port GUID to node port GUID */ + if (node->node.info.type == IBND_SWITCH_NODE) + port->port.guid = node->node.info.nodeportguid; + + get_remote_node(fabric, node, port, path, i, dist); + } + } + } + + fabric->fabric.chassis = group_nodes(fabric); + + return ((ibnd_fabric_t *)fabric); +error: + free(fabric); + return (NULL); +} + +static void +destroy_node(struct ibnd_node *node) +{ + int p = 0; + + for (p = 0; p <= node->node.info.numports; p++) { + free(node->node.ports[p]); + } + free(node->node.ports); + free(node); +} + +void +ibnd_destroy_fabric(ibnd_fabric_t *fabric) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + int dist = 0; + struct ibnd_node *node = NULL; + struct ibnd_node *next = NULL; + ibnd_chassis_t *ch, *ch_next; + + ch = f->first_chassis; + while (ch) { + ch_next = ch->next; + free(ch); + ch = ch_next; + } + for (dist = 0; dist <= MAXHOPS; dist++) { + node = f->nodesdist[dist]; + while (node) { + next = node->dnext; + destroy_node(node); + node = next; + } + } + if (f->ibmad_port) + mad_rpc_close_port(f->ibmad_port); + free(f); +} + +void +ibnd_debug(int i) +{ + if (i) { + ibdebug++; + madrpc_show_errors(1); + umad_debug(i); + } else { + ibdebug = 0; + madrpc_show_errors(0); + umad_debug(0); + } +} + +void +ibnd_show_progress(int i) +{ + show_progress = i; +} + +const char* +ibnd_node_type_str(ibnd_node_t *node) +{ + switch(node->info.type) { + case IBND_CA_NODE: return "Ca"; + case IBND_SWITCH_NODE: return "Switch"; + case IBND_ROUTER_NODE: return "Router"; + } + return "??"; +} + +const char* +ibnd_node_type_str_short(ibnd_node_t *node) +{ + switch(node->info.type) { + case IBND_SWITCH_NODE: return "SW"; + case IBND_CA_NODE: return "CA"; + case IBND_ROUTER_NODE: return "RT"; + } + return "??"; +} + + +void +ibnd_iter_nodes(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + void *user_data) +{ + ibnd_node_t *cur = NULL; + + for (cur = fabric->nodes; cur; cur = cur->next) { + func(cur, user_data); + } +} + + +void +ibnd_iter_nodes_type(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + ibnd_node_type_t node_type, + void *user_data) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + struct ibnd_node *list = NULL; + struct ibnd_node *cur = NULL; + + switch (node_type) { + case IBND_SWITCH_NODE: + list = f->switches; + break; + case IBND_CA_NODE: + list = f->ch_adapters; + break; + case IBND_ROUTER_NODE: + list = f->routers; + break; + default: + IBND_DEBUG("Invalid node_type specified %d\n", node_type); + break; + } + + for (cur = list; cur; cur = cur->type_next) { + func((ibnd_node_t *)cur, user_data); + } +} + diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h new file mode 100644 index 0000000..89f238f --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -0,0 +1,82 @@ +/* + * Copyright (c) 2008 Lawrence Livermore National Laboratory + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/** ========================================================================= + * Define the internal data structures. + */ + +#ifndef _INTERNAL_H_ +#define _INTERNAL_H_ + +#include + +struct ibnd_node { + /* This member MUST BE FIRST */ + ibnd_node_t node; + + /* internal use only */ + unsigned char ch_found; + struct ibnd_node *htnext; /* hash table list */ + struct ibnd_node *dnext; /* nodesdist next */ + struct ibnd_node *type_next; /* next based on type */ +}; +#define CONV_NODE_INTERNAL(node) ((struct ibnd_node *)node) + +struct ibnd_port { + /* This member MUST BE FIRST */ + ibnd_port_t port; + + /* internal use only */ + struct ibnd_port *htnext; +}; +#define CONV_PORT_INTERNAL(port) ((struct ibnd_port *)port) + +struct ibnd_fabric { + /* This member MUST BE FIRST */ + ibnd_fabric_t fabric; + + /* internal use only */ + void *ibmad_port; + struct ibnd_node *nodestbl[HTSZ]; + struct ibnd_port *portstbl[HTSZ]; + struct ibnd_node *nodesdist[MAXHOPS+1]; + ibnd_chassis_t *first_chassis; + ibnd_chassis_t *current_chassis; + ibnd_chassis_t *last_chassis; + struct ibnd_node *switches; + struct ibnd_node *ch_adapters; + struct ibnd_node *routers; +}; +#define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) + +#endif /* _INTERNAL_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map new file mode 100644 index 0000000..5e8c315 --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map @@ -0,0 +1,27 @@ +IBNETDISC_1.0 { + global: + ibnd_debug; + ibnd_show_progress; + ibnd_discover_fabric; + ibnd_cache_fabric; + ibnd_read_fabric; + ibnd_destroy_fabric; + ibnd_find_node_guid; + ibnd_update_node; + ibnd_find_node_dr; + ibnd_linkwidth_str; + ibnd_linkspeed_str; + ibnd_node_type_str; + ibnd_node_type_str_short; + ibnd_is_xsigo_guid; + ibnd_is_xsigo_tca; + ibnd_is_xsigo_hca; + ibnd_get_chassis_guid; + ibnd_get_chassis_type; + ibnd_get_chassis_slot_str; + ibnd_linkstate_str; + ibnd_physstate_str; + ibnd_iter_nodes; + ibnd_iter_nodes_type; + local: *; +}; diff --git a/infiniband-diags/libibnetdisc/test/iblinkinfotest.c b/infiniband-diags/libibnetdisc/test/iblinkinfotest.c new file mode 100644 index 0000000..e055aee --- /dev/null +++ b/infiniband-diags/libibnetdisc/test/iblinkinfotest.c @@ -0,0 +1,395 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +char *argv0 = "iblinkinfotest"; +static FILE *f; + +static char *node_name_map_file = NULL; +static nn_map_t *node_name_map = NULL; + +static int timeout_ms = 500; + +static int debug = 0; +#define DEBUG(str, args...) \ + if (debug) fprintf(stderr, str, ##args) + +static int down_links_only = 0; +static int line_mode = 0; +static int add_sw_settings = 0; +static int print_port_guids = 0; + +static unsigned int +get_max(unsigned int num) +{ + unsigned int v = num; // 32-bit word to find the log base 2 of + unsigned r = 0; // r will be lg(v) + + while (v >>= 1) // unroll for more speed... + { + r++; + } + + return (1 << r); +} + +void +get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t *port) +{ + int max_speed = 0; + + int max_width = get_max(port->info.link_width_supported + & port->remoteport->info.link_width_supported); + if ((max_width & port->info.link_width_active) == 0) { + // we are not at the max supported width + // print what we could be at. + snprintf(width_msg, msg_size, "Could be %s", + ibnd_linkwidth_str(max_width)); + } + + max_speed = get_max(port->info.link_speed_supported + & port->remoteport->info.link_speed_supported); + if ((max_speed & port->info.link_speed_active) == 0) { + // we are not at the max supported speed + // print what we could be at. + snprintf(speed_msg, msg_size, "Could be %s", + ibnd_linkspeed_str(max_speed, 1)); + } +} + +void +print_port(ibnd_node_t *node, ibnd_port_t *port) +{ + char remote_guid_str[256]; + char remote_str[256]; + char link_str[256]; + char width_msg[256]; + char speed_msg[256]; + char ext_port_str[256]; + + if (!port) + return; + + remote_guid_str[0] = '\0'; + remote_str[0] = '\0'; + link_str[0] = '\0'; + width_msg[0] = '\0'; + speed_msg[0] = '\0'; + + if (port->remoteport) { + char remote_name_buf[256]; + strncpy(remote_name_buf, port->remoteport->node->nodedesc, 256); + + if (port->remoteport->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum); + else + ext_port_str[0] = '\0'; + + get_msg(width_msg, speed_msg, 256, port); + if (line_mode) { + if (print_port_guids) { + snprintf(remote_guid_str, 256, + "0x%016"PRIx64" ", + port->remoteport->guid); + } else { + snprintf(remote_guid_str, 256, + "0x%016"PRIx64" ", + port->remoteport->node->info.nodeguid); + } + } + + snprintf(remote_str, 256, + "%s%6d %4d[%2s] \"%s\" (%s %s)\n", + remote_guid_str, + port->remoteport->info.base_lid ? + port->remoteport->info.base_lid : + port->remoteport->node->smalid, + port->remoteport->portnum, + ext_port_str, + remap_node_name(node_name_map, + port->remoteport->node->info.nodeguid, + remote_name_buf), + width_msg, + speed_msg + ); + } else { + snprintf(remote_str, 256, + "%6s %4s[%2s] \"\" ( )\n", "", "", ""); + } + + if (add_sw_settings) { + snprintf(link_str, 256, + "(%3s %s %6s/%8s) (HOQ:%d VL_Stall:%d)", + ibnd_linkwidth_str(port->info.link_width_active), + ibnd_linkspeed_str(port->info.link_speed_active, 1), + ibnd_linkstate_str(port->info.port_state), + ibnd_physstate_str(port->info.phys_state), + port->info.hoq_lifetime, + port->info.vl_stall_count + ); + } else { + snprintf(link_str, 256, + "(%3s %s %6s/%8s)", + ibnd_linkwidth_str(port->info.link_width_active), + ibnd_linkspeed_str(port->info.link_speed_active, 1), + ibnd_linkstate_str(port->info.port_state), + ibnd_physstate_str(port->info.phys_state) + ); + } + + if (port->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->ext_portnum); + else + ext_port_str[0] = '\0'; + + if (line_mode) { + char name_buf[256]; + strncpy(name_buf, node->nodedesc, 256); + printf("0x%016"PRIx64" \"%30s\" %6d %4d[%2s] ==%s==> %s", + node->info.nodeguid, + remap_node_name(node_name_map, + node->info.nodeguid, + name_buf), + node->smalid, port->portnum, + ext_port_str, + link_str, + remote_str + ); + } else { + printf(" %6d %4d[%2s] ==%s==> %s", + node->smalid, port->portnum, + ext_port_str, + link_str, + remote_str + ); + } +} + +void +print_switch(ibnd_node_t *node, void *user_data) +{ + int i = 0; + + if (!line_mode) { + char name_buf[256]; + strncpy(name_buf, node->nodedesc, 256); + printf("Switch 0x%016"PRIx64" %s:\n", + node->info.nodeguid, + remap_node_name(node_name_map, + node->info.nodeguid, + name_buf)); + } + + for (i = 1; i <= node->info.numports; i++) { + ibnd_port_t *port = node->ports[i]; + if (!port) + continue; + if (!down_links_only || port->info.port_state == IBND_LINK_DOWN) { + print_port(node, port); + } + } +} + +void +usage(void) +{ + fprintf(stderr, + "Usage: %s [-hclp -S -D -C -P ]\n" + " Report link speed and connection for each port of each switch which is active\n" + " -h This help message\n" + " -S output only the node specified by guid\n" + " -D print only node specified by \n" + " -f specify node to start \"from\"\n" + " -n Number of hops to include away from specified node\n" + " -d print only down links\n" + " -l (line mode) print all information for each link on each line\n" + " -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n" + + + " -t timeout for any single fabric query\n" + " -s show errors\n" + " --node-name-map use specified node name map\n" + + " -C use selected Channel Adaptor name for queries\n" + " -P use selected channel adaptor port for queries\n" + " -g print port guids instead of node guids\n" + " --debug print debug messages\n" + , + argv0); + exit(-1); +} + +int +main(int argc, char **argv) +{ + char *ca = 0; + int ca_port = 0; + ibnd_fabric_t *fabric = NULL; + uint64_t guid = 0; + char *dr_path = NULL; + char *from = NULL; + int hops = 0; + ib_portid_t port_id; + + static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:"; + static const struct option long_opts[] = { + { "S", 1, 0, 'S'}, + { "D", 1, 0, 'D'}, + { "num-hops", 1, 0, 'n'}, + { "down-links-only", 0, 0, 'd'}, + { "line-mode", 0, 0, 'l'}, + { "ca-name", 1, 0, 'C'}, + { "ca-port", 1, 0, 'P'}, + { "timeout", 1, 0, 't'}, + { "show", 0, 0, 's'}, + { "print-port-guids", 0, 0, 'g'}, + { "print-additional", 0, 0, 'p'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { "node-name-map", 1, 0, 1}, + { "debug", 0, 0, 2}, + { "from", 1, 0, 'f'}, + { } + }; + + f = stdout; + + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 1: + node_name_map_file = strdup(optarg); + break; + case 2: + debug = 1; + ibnd_debug(1); + break; + case 'f': + from = strdup(optarg); + break; + case 'C': + ca = strdup(optarg); + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'D': + dr_path = strdup(optarg); + break; + case 'n': + hops = (int)strtol(optarg, NULL, 0); + break; + case 'd': + down_links_only = 1; + break; + case 'l': + line_mode = 1; + break; + case 't': + timeout_ms = strtoul(optarg, 0, 0); + break; + case 'g': + print_port_guids = 1; + break; + case 'S': + guid = (uint64_t)strtoull(optarg, 0, 0); + break; + case 'p': + add_sw_settings = 1; + break; + default: + usage(); + break; + } + } + argc -= optind; + argv += optind; + + if (argc && !(f = fopen(argv[0], "w"))) + fprintf(stderr, "can't open file %s for writing", argv[0]); + + node_name_map = open_node_name_map(node_name_map_file); + + if (from) { + /* only scan part of the fabric */ + str2drpath(&(port_id.drpath), from, 0, 0); + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + guid = 0; + } else { + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + } + + if (guid) { + ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid); + print_switch(sw, NULL); + } else if (dr_path) { + ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path); + print_switch(sw, NULL); + } else { + ibnd_iter_nodes_type(fabric, print_switch, IBND_SWITCH_NODE, NULL); + } + + ibnd_destroy_fabric(fabric); + + close_node_name_map(node_name_map); + exit(0); +} diff --git a/infiniband-diags/libibnetdisc/test/ibnetdisctest.c b/infiniband-diags/libibnetdisc/test/ibnetdisctest.c new file mode 100644 index 0000000..c195f6a --- /dev/null +++ b/infiniband-diags/libibnetdisc/test/ibnetdisctest.c @@ -0,0 +1,680 @@ +/* + * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +static int verbose; +#define LIST_CA_NODE (1 << IBND_CA_NODE) +#define LIST_SWITCH_NODE (1 << IBND_SWITCH_NODE) +#define LIST_ROUTER_NODE (1 << IBND_ROUTER_NODE) + +char *argv0 = "ibnetdiscover"; +static FILE *f; + +static char *node_name_map_file = NULL; +static nn_map_t *node_name_map = NULL; + +static int timeout_ms = 2000; + +static int debug = 0; +#define DEBUG(str, args...) \ + if (debug) fprintf(stderr, str, ##args) + + +char * +node_name(ibnd_node_t *node) +{ + static char buf[256]; + + switch(node->info.type) { + case IBND_CA_NODE: + sprintf(buf, "\"%s", "H"); + break; + case IBND_SWITCH_NODE: + sprintf(buf, "\"%s", "S"); + break; + case IBND_ROUTER_NODE: + sprintf(buf, "\"%s", "R"); + break; + default: + sprintf(buf, "\"%s", "?"); + break; + } + sprintf(buf+2, "-%016" PRIx64 "\"", node->info.nodeguid); + + return buf; +} + +void +list_node(ibnd_node_t *node, void *user_data) +{ + char *nodename = remap_node_name(node_name_map, node->info.nodeguid, + node->nodedesc); + + fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n", + ibnd_node_type_str(node), + node->info.nodeguid, node->info.numports, node->info.devid, + node->info.vendid, + nodename); + + free(nodename); +} + +void +list_nodes(ibnd_fabric_t *fabric, int list) +{ + if (list & LIST_CA_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IBND_CA_NODE, NULL); + } + if (list & LIST_SWITCH_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IBND_SWITCH_NODE, NULL); + } + if (list & LIST_ROUTER_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IBND_ROUTER_NODE, NULL); + } +} + +void +out_ids(ibnd_node_t *node, int group, char *chname) +{ + fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->info.vendid, node->info.devid); + if (node->info.sysimgguid) + fprintf(f, "sysimgguid=0x%" PRIx64, node->info.sysimgguid); + if (group + && node->chassis && node->chassis->chassisnum) { + fprintf(f, "\t\t# Chassis %d", node->chassis->chassisnum); + if (chname) + fprintf(f, " (%s)", clean_nodedesc(chname)); + if (ibnd_is_xsigo_tca(node->info.nodeguid) + && node->ports[1] + && node->ports[1]->remoteport) + fprintf(f, " slot %d", node->ports[1]->remoteport->portnum); + } + fprintf(f, "\n"); +} + + +uint64_t +out_chassis(ibnd_fabric_t *fabric, int chassisnum) +{ + uint64_t guid; + + fprintf(f, "\nChassis %d", chassisnum); + guid = ibnd_get_chassis_guid(fabric, chassisnum); + if (guid) + fprintf(f, " (guid 0x%" PRIx64 ")", guid); + fprintf(f, "\n"); + return guid; +} + +void +out_switch(ibnd_node_t *node, int group, char *chname) +{ + char *str; + char str2[256]; + char *nodename = NULL; + + out_ids(node, group, chname); + fprintf(f, "switchguid=0x%" PRIx64, node->info.nodeguid); + fprintf(f, "(%" PRIx64 ")", node->info.nodeportguid); + if (group) { + str = ibnd_get_chassis_type(node); + if (str) + fprintf(f, "%s ", str); + str = ibnd_get_chassis_slot_str(node, str2, 256); + if (str) + fprintf(f, "%s", str); + } + + nodename = remap_node_name(node_name_map, node->info.nodeguid, + node->nodedesc); + + fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n", + node->info.numports, node_name(node), + nodename, + node->sw_info.smaenhsp0 ? "enhanced" : "base", + node->smalid, node->smalmc); + + free(nodename); +} + +void +out_ca(ibnd_node_t *node, int group, char *chname) +{ + char *node_type; + char *node_type2; + + out_ids(node, group, chname); + switch(node->info.type) { + case IBND_CA_NODE: + node_type = "ca"; + node_type2 = "Ca"; + break; + case IBND_ROUTER_NODE: + node_type = "rt"; + node_type2 = "Rt"; + break; + default: + node_type = "???"; + node_type2 = "???"; + break; + } + + fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->info.nodeguid); + fprintf(f, "%s\t%d %s\t\t# \"%s\"", + node_type2, node->info.numports, node_name(node), + clean_nodedesc(node->nodedesc)); + if (group && ibnd_is_xsigo_hca(node->info.nodeguid)) + fprintf(f, " (scp)"); + fprintf(f, "\n"); +} + +#define OUT_BUFFER_SIZE 16 +static char * +out_ext_port(ibnd_port_t *port, int group) +{ + static char mapping[OUT_BUFFER_SIZE]; + + if (group && port->ext_portnum != 0) { + snprintf(mapping, OUT_BUFFER_SIZE, + "[ext %d]", port->ext_portnum); + return (mapping); + } + + return (NULL); +} + +void +out_switch_port(ibnd_port_t *port, int group) +{ + char *ext_port_str = NULL; + char *rem_nodename = NULL; + + DEBUG("port %p:%d remoteport %p\n", port, port->portnum, port->remoteport); + fprintf(f, "[%d]", port->portnum); + + ext_port_str = out_ext_port(port, group); + if (ext_port_str) + fprintf(f, "%s", ext_port_str); + + rem_nodename = remap_node_name(node_name_map, + port->remoteport->node->info.nodeguid, + port->remoteport->node->nodedesc); + + ext_port_str = out_ext_port(port->remoteport, group); + fprintf(f, "\t%s[%d]%s", + node_name(port->remoteport->node), + port->remoteport->portnum, + ext_port_str ? ext_port_str : ""); + if (port->remoteport->node->info.type != IBND_SWITCH_NODE) + fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid); + fprintf(f, "\t\t# \"%s\" lid %d %s%s", + rem_nodename, + port->remoteport->node->info.type == IBND_SWITCH_NODE ? + port->remoteport->node->smalid : + port->remoteport->info.base_lid, + ibnd_linkwidth_str(port->info.link_width_active), + ibnd_linkspeed_str(port->info.link_speed_active, 0)); + + if (ibnd_is_xsigo_tca(port->remoteport->guid)) + fprintf(f, " slot %d", port->portnum); + else if (ibnd_is_xsigo_hca(port->remoteport->guid)) + fprintf(f, " (scp)"); + fprintf(f, "\n"); + + free(rem_nodename); +} + +void +out_ca_port(ibnd_port_t *port, int group) +{ + char *str = NULL; + char *rem_nodename = NULL; + + fprintf(f, "[%d]", port->portnum); + if (port->node->info.type != IBND_SWITCH_NODE) + fprintf(f, "(%" PRIx64 ") ", port->guid); + fprintf(f, "\t%s[%d]", + node_name(port->remoteport->node), + port->remoteport->portnum); + str = out_ext_port(port->remoteport, group); + if (str) + fprintf(f, "%s", str); + if (port->remoteport->node->info.type != IBND_SWITCH_NODE) + fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid); + + rem_nodename = remap_node_name(node_name_map, + port->remoteport->node->info.nodeguid, + port->remoteport->node->nodedesc); + + fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n", + port->info.base_lid, port->info.lmc, rem_nodename, + port->remoteport->node->info.type == IBND_SWITCH_NODE ? + port->remoteport->node->smalid : + port->remoteport->info.base_lid, + ibnd_linkwidth_str(port->info.link_width_active), + ibnd_linkspeed_str(port->info.link_speed_active, 0)); + + free(rem_nodename); +} + +struct iter_user_data { + int group; + int skip_chassis_nodes; +}; + +static void +switch_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("SWITCH: node %p\n", node); + + /* skip chassis based switches if flagged */ + if (data->skip_chassis_nodes && node->chassis && node->chassis->chassisnum) + return; + + out_switch(node, data->group, NULL); + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_switch_port(port, data->group); + } +} + +static void +ca_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("CA: node %p\n", node); + /* Now, skip chassis based CAs */ + if (data->group && node->chassis && node->chassis->chassisnum) + return; + out_ca(node, data->group, NULL); + + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, data->group); + } +} + +static void +router_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("RT: node %p\n", node); + /* Now, skip chassis based RTs */ + if (data->group && node->chassis && node->chassis->chassisnum) + return; + out_ca(node, data->group, NULL); + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, data->group); + } +} + +int +dump_topology(int group, ibnd_fabric_t *fabric) +{ + ibnd_node_t *node; + ibnd_port_t *port; + int i = 0, p = 0; + time_t t = time(0); + uint64_t chguid; + char *chname = NULL; + struct iter_user_data iter_user_data; + + fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); + fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered); + fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", + fabric->from_node->info.nodeguid, fabric->from_node->info.nodeportguid); + + /* Make pass on switches */ + if (group) { + ibnd_chassis_t *ch = NULL; + + /* Chassis based switches first */ + for (ch = fabric->chassis; ch; ch = ch->next) { + int n = 0; + + if (!ch->chassisnum) + continue; + chguid = out_chassis(fabric, ch->chassisnum); + + chname = NULL; +/** + * Will this work for Xsigo? + */ + if (ibnd_is_xsigo_guid(chguid)) { + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (ibnd_is_xsigo_hca(node->info.nodeguid)) { + chname = node->nodedesc; + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); + } + } + +#if 0 +/** + * vs. this? + * I don't want to expose the nodesdist array to the end user. + */ + for (node = fabric->nodesdist[MAXHOPS]; node; node = node->dnext) { + if (!node->chrecord || + !node->chrecord->chassisnum) + continue; + + if (node->chrecord->chassisnum != ch->chassisnum) + continue; + + if (ibnd_is_xsigo_hca(node->nodeguid)) { + chname = node->nodedesc; + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); + } + } +#endif + } + + fprintf(f, "\n# Spine Nodes"); + for (n = 1; n <= SPINES_MAX_NUM; n++) { + if (ch->spinenode[n]) { + out_switch(ch->spinenode[n], group, chname); + for (p = 1; p <= ch->spinenode[n]->info.numports; p++) { + port = ch->spinenode[n]->ports[p]; + if (port && port->remoteport) + out_switch_port(port, group); + } + } + } + fprintf(f, "\n# Line Nodes"); + for (n = 1; n <= LINES_MAX_NUM; n++) { + if (ch->linenode[n]) { + out_switch(ch->linenode[n], group, chname); + for (p = 1; p <= ch->linenode[n]->info.numports; p++) { + port = ch->linenode[n]->ports[p]; + if (port && port->remoteport) + out_switch_port(port, group); + } + } + } + + fprintf(f, "\n# Chassis Switches"); + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (node->info.type == IBND_SWITCH_NODE) { + out_switch(node, group, chname); + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_switch_port(port, group); + } + } + } + + fprintf(f, "\n# Chassis CAs"); + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (node->info.type == IBND_CA_NODE) { + out_ca(node, group, chname); + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, group); + } + } + } + + } + + } else { /* !group */ + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 0; + + ibnd_iter_nodes_type(fabric, switch_iter_func, + IBND_SWITCH_NODE, &iter_user_data); + } + + chname = NULL; + if (group) { + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 1; + + fprintf(f, "\nNon-Chassis Nodes\n"); + ibnd_iter_nodes_type(fabric, switch_iter_func, + IBND_SWITCH_NODE, &iter_user_data); + + } + + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 0; + + /* Make pass on CAs */ + ibnd_iter_nodes_type(fabric, ca_iter_func, IBND_CA_NODE, + &iter_user_data); + + /* make pass on routers */ + ibnd_iter_nodes_type(fabric, router_iter_func, IBND_ROUTER_NODE, + &iter_user_data); + + return i; +} + + +void dump_ports_report (ibnd_node_t *node, void *user_data) +{ + int p = 0; + ibnd_port_t *port = NULL; + + /* for each port */ + for (p = node->info.numports, port = node->ports[p]; + p > 0; + port = node->ports[--p]) { + if (port == NULL) + continue; + + fprintf(stdout, + "%2s %5d %2d 0x%016" PRIx64 " %s %s", + ibnd_node_type_str_short(node), + node->info.type == IBND_SWITCH_NODE ? + node->smalid : port->info.base_lid, + port->portnum, + port->guid, + ibnd_linkwidth_str(port->info.link_width_active), + ibnd_linkspeed_str(port->info.link_speed_active, 0)); + if (port->remoteport) + fprintf(stdout, + " - %2s %5d %2d 0x%016" PRIx64 + " ( '%s' - '%s' )\n", + ibnd_node_type_str_short(port->remoteport->node), + port->remoteport->node->info.type == IBND_SWITCH_NODE ? + port->remoteport->node->smalid : + port->remoteport->info.base_lid, + port->remoteport->portnum, + port->remoteport->guid, + port->node->nodedesc, + port->remoteport->node->nodedesc); + else + fprintf(stdout, "%36s'%s'\n", "", + port->node->nodedesc); + } +} + +void +usage(void) +{ + fprintf(stderr, "Usage: %s [-d(ebug)] -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port " + "-t(imeout) timeout_ms --node-name-map node-name-map] -p(orts) []\n", + argv0); + fprintf(stderr, " --node-name-map specify a node name map file\n"); + exit(-1); +} + +int +main(int argc, char **argv) +{ + int list = 0; + char *ca = 0; + int ca_port = 0; + int group = 0; + int ports_report = 0; + ibnd_fabric_t *fabric = NULL; + + static char const str_opts[] = "C:P:t:devslgHSRpVhu"; + static const struct option long_opts[] = { + { "C", 1, 0, 'C'}, + { "P", 1, 0, 'P'}, + { "debug", 0, 0, 'd'}, + { "verbose", 0, 0, 'v'}, + { "show", 0, 0, 's'}, + { "list", 0, 0, 'l'}, + { "grouping", 0, 0, 'g'}, + { "Hca_list", 0, 0, 'H'}, + { "Switch_list", 0, 0, 'S'}, + { "Router_list", 0, 0, 'R'}, + { "timeout", 1, 0, 't'}, + { "node-name-map", 1, 0, 1}, + { "ports", 0, 0, 'p'}, + { "Version", 0, 0, 'V'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { } + }; + + f = stdout; + + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 1: + node_name_map_file = strdup(optarg); + break; + case 'C': + ca = optarg; + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'd': + debug = 1; + ibnd_debug(1); + break; + case 't': + timeout_ms = strtoul(optarg, 0, 0); + break; + case 'v': + verbose++; + break; + case 's': + ibnd_show_progress(1); + break; + case 'l': + list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; + break; + case 'g': + group = 1; + break; + case 'S': + list |= LIST_SWITCH_NODE; + break; + case 'H': + list |= LIST_CA_NODE; + break; + case 'R': + list |= LIST_ROUTER_NODE; + break; + case 'p': + ports_report = 1; + break; + default: + usage(); + break; + } + } + argc -= optind; + argv += optind; + + if (argc && !(f = fopen(argv[0], "w"))) + fprintf(stderr, "can't open file %s for writing", argv[0]); + + node_name_map = open_node_name_map(node_name_map_file); + + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + + if (ports_report) + ibnd_iter_nodes(fabric, + dump_ports_report, + NULL); + else if (list) + list_nodes(fabric, list); + else + dump_topology(group, fabric); + + ibnd_destroy_fabric(fabric); + close_node_name_map(node_name_map); + exit(0); +} diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c new file mode 100644 index 0000000..1fabaac --- /dev/null +++ b/infiniband-diags/libibnetdisc/test/testleaks.c @@ -0,0 +1,179 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +char *argv0 = "iblinkinfotest"; +static FILE *f; + +static int timeout_ms = 500; + +void +usage(void) +{ + fprintf(stderr, + "Usage: %s [-hclp -S -D -C -P ]\n" + " Report link speed and connection for each port of each switch which is active\n" + " -h This help message\n" + " -i Number of iterations to run (default -1 == infinate)\n" + + " -S output only the node specified by guid\n" + " -D print only node specified by \n" + " -f specify node to start \"from\"\n" + " -n Number of hops to include away from specified node\n" + + " -t timeout for any single fabric query\n" + " -s show errors\n" + + " -C use selected Channel Adaptor name for queries\n" + " -P use selected channel adaptor port for queries\n" + " --debug print debug messages\n" + , + argv0); + exit(-1); +} + +int +main(int argc, char **argv) +{ + char *ca = 0; + int ca_port = 0; + ibnd_fabric_t *fabric = NULL; + uint64_t guid = 0; + char *dr_path = NULL; + char *from = NULL; + int hops = 0; + ib_portid_t port_id; + int iters = -1; + + static char const str_opts[] = "S:D:n:C:P:t:shuf:i:"; + static const struct option long_opts[] = { + { "S", 1, 0, 'S'}, + { "D", 1, 0, 'D'}, + { "num-hops", 1, 0, 'n'}, + { "ca-name", 1, 0, 'C'}, + { "ca-port", 1, 0, 'P'}, + { "timeout", 1, 0, 't'}, + { "show", 0, 0, 's'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { "debug", 0, 0, 2}, + { "from", 1, 0, 'f'}, + { "iters", 1, 0, 'i'}, + { } + }; + + f = stdout; + + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 2: + ibnd_debug(1); + break; + case 'f': + from = strdup(optarg); + break; + case 'C': + ca = strdup(optarg); + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'D': + dr_path = strdup(optarg); + break; + case 'n': + hops = (int)strtol(optarg, NULL, 0); + break; + case 'i': + iters = (int)strtol(optarg, NULL, 0); + break; + case 't': + timeout_ms = strtoul(optarg, 0, 0); + break; + case 'S': + guid = (uint64_t)strtoull(optarg, 0, 0); + break; + default: + usage(); + break; + } + } + argc -= optind; + argv += optind; + + while (iters == -1 || iters-- > 0) { + if (from) { + /* only scan part of the fabric */ + str2drpath(&(port_id.drpath), from, 0, 0); + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, + &port_id, hops)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + guid = 0; + } else { + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + } + + ibnd_destroy_fabric(fabric); + } + + exit(0); +} -- 1.5.4.5 From weiny2 at llnl.gov Fri Jan 9 15:47:54 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 9 Jan 2009 15:47:54 -0800 Subject: [ofa-general] [PATCH 2/3 - no ibcommon] Convert iblinkinfo.pl to C and use new ibnetdisc library. Message-ID: <20090109154754.1d526572.weiny2@llnl.gov> >From 139d0ad5ecffecd5c325d865e96cdf038e03a3e5 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Mon, 1 Dec 2008 14:55:10 -0800 Subject: [PATCH] Convert iblinkinfo.pl to C and use new ibnetdisc library. Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/Makefile.am | 12 +- infiniband-diags/scripts/iblinkinfo.pl | 327 ------------------------ infiniband-diags/src/iblinkinfo.c | 423 ++++++++++++++++++++++++++++++++ 3 files changed, 432 insertions(+), 330 deletions(-) delete mode 100755 infiniband-diags/scripts/iblinkinfo.pl create mode 100644 infiniband-diags/src/iblinkinfo.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index 8e8c3c1..d127a4d 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -1,6 +1,7 @@ SUBDIRS = libibnetdisc -INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband \ + -I$(top_builddir)/libibnetdisc/include if DEBUG DBGFLAGS = -ggdb -D_DEBUG_ @@ -11,7 +12,7 @@ endif sbin_PROGRAMS = src/ibaddr src/ibnetdiscover src/ibping src/ibportstate \ src/ibroute src/ibstat src/ibsysstat src/ibtracert \ src/perfquery src/sminfo src/smpdump src/smpquery \ - src/saquery src/vendstat + src/saquery src/vendstat src/iblinkinfo.pl if ENABLE_TEST_UTILS sbin_PROGRAMS += src/ibsendtrap src/mcm_rereg_test @@ -28,7 +29,7 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \ scripts/dump_lfts.sh scripts/dump_mfts.sh \ scripts/set_nodedesc.sh \ scripts/ibqueryerrors.pl scripts/ibswportwatch.pl \ - scripts/iblinkinfo.pl scripts/ibprintswitch.pl \ + scripts/ibprintswitch.pl \ scripts/ibprintca.pl scripts/ibprintrt.pl \ scripts/ibfindnodesusing.pl scripts/ibidsverify.pl \ scripts/check_lft_balance.pl @@ -40,6 +41,11 @@ src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS) src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) +src_iblinkinfo_pl_SOURCES = src/iblinkinfo.c +src_iblinkinfo_pl_CFLAGS = -Wall $(DBGFLAGS) +src_iblinkinfo_pl_LDFLAGS = -Wl,--rpath -Wl,$(libdir) \ + -L$(srcdir)/libibnetdisc -libnetdisc + src_ibping_SOURCES = src/ibping.c src/ibdiag_common.c src_ibping_CFLAGS = -Wall $(DBGFLAGS) diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl deleted file mode 100755 index b6b27ce..0000000 --- a/infiniband-diags/scripts/iblinkinfo.pl +++ /dev/null @@ -1,327 +0,0 @@ -#!/usr/bin/perl -# -# Copyright (c) 2006 The Regents of the University of California. -# Copyright (c) 2007-2008 Voltaire, Inc. All rights reserved. -# -# Produced at Lawrence Livermore National Laboratory. -# Written by Ira Weiny . -# -# This software is available to you under a choice of one of two -# licenses. You may choose to be licensed under the terms of the GNU -# General Public License (GPL) Version 2, available from the file -# COPYING in the main directory of this source tree, or the -# OpenIB.org BSD license below: -# -# Redistribution and use in source and binary forms, with or -# without modification, are permitted provided that the following -# conditions are met: -# -# - Redistributions of source code must retain the above -# copyright notice, this list of conditions and the following -# disclaimer. -# -# - Redistributions in binary form must reproduce the above -# copyright notice, this list of conditions and the following -# disclaimer in the documentation and/or other materials -# provided with the distribution. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, -# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND -# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS -# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN -# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN -# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -# - -use strict; - -use Getopt::Std; -use IBswcountlimits; - -sub usage_and_exit -{ - my $prog = $_[0]; - print -"Usage: $prog [-Rhclp -S -D -C -P ]\n"; - print -" Report link speed and connection for each port of each switch which is active\n"; - print " -h This help message\n"; - print -" -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n"; - print -" -D output only the switch specified by direct route path\n"; - print " -S output only the switch specified by (hex format)\n"; - print " -d print only down links\n"; - print - " -l (line mode) print all information for each link on each line\n"; - print -" -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n"; - print " -c print port capabilities (enabled/supported values)\n"; - print " -C use selected Channel Adaptor name for queries\n"; - print " -P use selected channel adaptor port for queries\n"; - print " -g print port guids instead of node guids\n"; - exit 2; -} - -my $argv0 = `basename $0`; -my $regenerate_map = undef; -my $single_switch = undef; -my $direct_route = undef; -my $line_mode = undef; -my $print_add_switch = undef; -my $print_extended_cap = undef; -my $only_down_links = undef; -my $ca_name = ""; -my $ca_port = ""; -my $print_port_guids = undef; -my $switch_found = "no"; -chomp $argv0; - -if (!getopts("hcpldRS:D:C:P:g")) { usage_and_exit $argv0; } -if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } -if (defined $Getopt::Std::opt_D) { $direct_route = $Getopt::Std::opt_D; } -if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } -if (defined $Getopt::Std::opt_S) { - $single_switch = format_guid($Getopt::Std::opt_S); -} -if (defined $Getopt::Std::opt_d) { $only_down_links = $Getopt::Std::opt_d; } -if (defined $Getopt::Std::opt_l) { $line_mode = $Getopt::Std::opt_l; } -if (defined $Getopt::Std::opt_p) { $print_add_switch = $Getopt::Std::opt_p; } -if (defined $Getopt::Std::opt_c) { $print_extended_cap = $Getopt::Std::opt_c; } -if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } -if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } -if (defined $Getopt::Std::opt_g) { $print_port_guids = $Getopt::Std::opt_g; } - -my $extra_smpquery_params = get_ca_name_port_param_string($ca_name, $ca_port); - -sub main -{ - get_link_ends($regenerate_map, $ca_name, $ca_port); - if (defined($direct_route)) { - # convert DR to guid, then use original single_switch option - $single_switch = convert_dr_to_guid($direct_route); - if (!defined($single_switch) || !is_switch($single_switch)) { - printf("The direct route (%s) does not map to a switch.\n", - $direct_route); - return; - } - } - foreach my $switch (sort (keys(%IBswcountlimits::link_ends))) { - if ($single_switch && $switch ne $single_switch) { - next; - } else { - $switch_found = "yes"; - } - my $switch_prompt = "no"; - my $num_ports = get_num_ports($switch, $ca_name, $ca_port); - if ($num_ports == 0) { - printf("ERROR: switch $switch has 0 ports???\n"); - } - my @output_lines = undef; - my $pkt_lifetime = ""; - my $pkt_life_prompt = ""; - my $port_timeouts = ""; - my $print_switch = "yes"; - if ($only_down_links) { $print_switch = "no"; } - if ($print_add_switch) { - my $data = `smpquery $extra_smpquery_params -G switchinfo $switch`; - if ($data eq "") { - printf("ERROR: failed to get switchinfo for $switch\n"); - } - my @lines = split("\n", $data); - foreach my $line (@lines) { - if ($line =~ /^LifeTime:\.+(.*)/) { $pkt_lifetime = $1; } - } - $pkt_life_prompt = sprintf(" (LT: %2s)", $pkt_lifetime); - } - foreach my $port (1 .. $num_ports) { - my $hr = $IBswcountlimits::link_ends{$switch}{$port}; - if ($switch_prompt eq "no" && !$line_mode) { - my $switch_name = ""; - my $tmp_port = $port; - while ($switch_name eq "" && $tmp_port <= $num_ports) { - # the first port is down find switch name with up port - my $hr = $IBswcountlimits::link_ends{$switch}{$tmp_port}; - $switch_name = $hr->{loc_desc}; - $tmp_port++; - } - if ($switch_name eq "") { - printf( - "WARNING: Switch Name not found for $switch\n"); - } - push( - @output_lines, - sprintf( - "Switch %18s %s%s:\n", - $switch, $switch_name, $pkt_life_prompt - ) - ); - $switch_prompt = "yes"; - } - my $data = - `smpquery $extra_smpquery_params -G portinfo $switch $port`; - if ($data eq "") { - printf( - "ERROR: failed to get portinfo for $switch port $port\n"); - } - my @lines = split("\n", $data); - my $speed = ""; - my $speed_sup = ""; - my $speed_enable = ""; - my $width = ""; - my $width_sup = ""; - my $width_enable = ""; - my $state = ""; - my $hoq_life = ""; - my $vl_stall = ""; - my $phy_link_state = ""; - - foreach my $line (@lines) { - if ($line =~ /^LinkSpeedActive:\.+(.*)/) { $speed = $1; } - if ($line =~ /^LinkSpeedEnabled:\.+(.*)/) { - $speed_enable = $1; - } - if ($line =~ /^LinkSpeedSupported:\.+(.*)/) { $speed_sup = $1; } - if ($line =~ /^LinkWidthActive:\.+(.*)/) { $width = $1; } - if ($line =~ /^LinkWidthEnabled:\.+(.*)/) { - $width_enable = $1; - } - if ($line =~ /^LinkWidthSupported:\.+(.*)/) { $width_sup = $1; } - if ($line =~ /^LinkState:\.+(.*)/) { $state = $1; } - if ($line =~ /^HoqLife:\.+(.*)/) { $hoq_life = $1; } - if ($line =~ /^VLStallCount:\.+(.*)/) { $vl_stall = $1; } - if ($line =~ /^PhysLinkState:\.+(.*)/) { $phy_link_state = $1; } - } - my $rem_port = $hr->{rem_port}; - my $rem_lid = $hr->{rem_lid}; - my $rem_speed_sup = ""; - my $rem_speed_enable = ""; - my $rem_width_sup = ""; - my $rem_width_enable = ""; - if ($rem_lid ne "" && $rem_port ne "") { - $data = - `smpquery $extra_smpquery_params portinfo $rem_lid $rem_port`; - if ($data eq "") { - printf( - "ERROR: failed to get portinfo for $switch port $port\n" - ); - } - my @lines = split("\n", $data); - foreach my $line (@lines) { - if ($line =~ /^LinkSpeedEnabled:\.+(.*)/) { - $rem_speed_enable = $1; - } - if ($line =~ /^LinkSpeedSupported:\.+(.*)/) { - $rem_speed_sup = $1; - } - if ($line =~ /^LinkWidthEnabled:\.+(.*)/) { - $rem_width_enable = $1; - } - if ($line =~ /^LinkWidthSupported:\.+(.*)/) { - $rem_width_sup = $1; - } - } - } - my $capabilities = ""; - if ($print_extended_cap) { - $capabilities = sprintf("(%3s %s %6s / %8s [%s/%s][%s/%s])", - $width, $speed, $state, $phy_link_state, $width_enable, - $width_sup, $speed_enable, $speed_sup); - } else { - $capabilities = sprintf("(%3s %s %6s / %8s)", - $width, $speed, $state, $phy_link_state); - } - if ($print_add_switch) { - $port_timeouts = - sprintf(" (HOQ:%s VL_Stall:%s)", $hoq_life, $vl_stall); - } - if (!$only_down_links || ($only_down_links && $state eq "Down")) { - my $width_msg = ""; - my $speed_msg = ""; - if ($rem_width_enable ne "" && $rem_width_sup ne "") { - if ( $width_enable =~ /12X/ - && $rem_width_enable =~ /12X/ - && $width !~ /12X/) - { - $width_msg = "Could be 12X"; - } else { - if ( $width_enable =~ /8X/ - && $rem_width_enable =~ /8X/ - && $width !~ /8X/) - { - $width_msg = "Could be 8X"; - } else { - if ( $width_enable =~ /4X/ - && $rem_width_enable =~ /4X/ - && $width !~ /4X/) - { - $width_msg = "Could be 4X"; - } - } - } - } - if ($rem_speed_enable ne "" && $rem_speed_sup ne "") { - if ( $speed_enable =~ /10\.0/ - && $rem_speed_enable =~ /10\.0/ - && $speed !~ /10\.0/) - { - $speed_msg = "Could be 10.0 Gbps"; - } else { - if ( $speed_enable =~ /5\.0/ - && $rem_speed_enable =~ /5\.0/ - && $speed !~ /5\.0/) - { - $speed_msg = "Could be 5.0 Gbps"; - } - } - } - - if ($line_mode) { - my $line_begin = sprintf("%18s \"%30s\"%s", - $switch, $hr->{loc_desc}, $pkt_life_prompt); - my $ext_guid = sprintf("%18s", $hr->{rem_guid}); - if ($print_port_guids && $hr->{rem_port_guid} ne "") { - $ext_guid = sprintf("0x%016s", $hr->{rem_port_guid}); - } - push( - @output_lines, - sprintf( -"%s %6s %4s[%2s] ==%s%s==> %18s %6s %4s[%2s] \"%s\" ( %s %s)\n", - $line_begin, $hr->{loc_sw_lid}, - $port, $hr->{loc_ext_port}, - $capabilities, $port_timeouts, - $ext_guid, $hr->{rem_lid}, - $hr->{rem_port}, $hr->{rem_ext_port}, - $hr->{rem_desc}, $width_msg, - $speed_msg - ) - ); - } else { - push( - @output_lines, - sprintf( -" %6s %4s[%2s] ==%s%s==> %6s %4s[%2s] \"%s\" ( %s %s)\n", - $hr->{loc_sw_lid}, $port, - $hr->{loc_ext_port}, $capabilities, - $port_timeouts, $hr->{rem_lid}, - $hr->{rem_port}, $hr->{rem_ext_port}, - $hr->{rem_desc}, $width_msg, - $speed_msg - ) - ); - } - $print_switch = "yes"; - } - } - if ($print_switch eq "yes") { - foreach my $line (@output_lines) { print $line; } - } - } - if ($single_switch && $switch_found ne "yes") { - printf("Switch \"%s\" not found.\n", $single_switch); - } -} -main; - diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c new file mode 100644 index 0000000..c8f224e --- /dev/null +++ b/infiniband-diags/src/iblinkinfo.c @@ -0,0 +1,423 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +char *argv0 = "iblinkinfotest"; +static FILE *f; + +static char *node_name_map_file = NULL; +static nn_map_t *node_name_map = NULL; + +static int timeout_ms = 500; + +static int down_links_only = 0; +static int line_mode = 0; +static int add_sw_settings = 0; +static int print_port_guids = 0; +static int old_output = 0; + +static unsigned int +get_max(unsigned int num) +{ + unsigned int v = num; // 32-bit word to find the log base 2 of + unsigned r = 0; // r will be lg(v) + + while (v >>= 1) // unroll for more speed... + { + r++; + } + + return (1 << r); +} + +static char * +get_linkspeed_str(int link_speed) +{ + return (ibnd_linkspeed_str(link_speed, old_output)); +} + +void +get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t *port) +{ + int max_speed = 0; + + int max_width = get_max(port->info.link_width_supported + & port->remoteport->info.link_width_supported); + if ((max_width & port->info.link_width_active) == 0) { + // we are not at the max supported width + // print what we could be at. + snprintf(width_msg, msg_size, "Could be %s", + ibnd_linkwidth_str(max_width)); + } + + max_speed = get_max(port->info.link_speed_supported + & port->remoteport->info.link_speed_supported); + if ((max_speed & port->info.link_speed_active) == 0) { + // we are not at the max supported speed + // print what we could be at. + snprintf(speed_msg, msg_size, "Could be %s", + get_linkspeed_str(max_speed)); + } +} + +void +print_port(ibnd_node_t *node, ibnd_port_t *port) +{ + static char remote_guid_str[256]; + static char remote_str[256]; + static char link_str[256]; + static char width_msg[256]; + static char speed_msg[256]; + static char ext_port_str[256]; + static char loc_sma_lid[16]; + + if (!port) + return; + + remote_guid_str[0] = '\0'; + remote_str[0] = '\0'; + link_str[0] = '\0'; + width_msg[0] = '\0'; + speed_msg[0] = '\0'; + + snprintf(loc_sma_lid, 16, "%d", node->smalid); + if (port->remoteport) { + static char remote_name_buf[256]; + strncpy(remote_name_buf, port->remoteport->node->nodedesc, 256); + + if (port->remoteport->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum); + else + ext_port_str[0] = '\0'; + + get_msg(width_msg, speed_msg, 256, port); + if (line_mode) { + if (print_port_guids) { + snprintf(remote_guid_str, 256, + "0x%016"PRIx64" ", + port->remoteport->guid); + } else { + snprintf(remote_guid_str, 256, + "0x%016"PRIx64" ", + port->remoteport->node->info.nodeguid); + } + } + + snprintf(remote_str, 256, + "%s%6d %4d[%2s] \"%s\" ( %s %s)\n", + remote_guid_str, + port->remoteport->info.base_lid ? + port->remoteport->info.base_lid : + port->remoteport->node->smalid, + port->remoteport->portnum, + ext_port_str, + remap_node_name(node_name_map, + port->remoteport->node->info.nodeguid, + remote_name_buf), + width_msg, + speed_msg + ); + } else { + snprintf(remote_str, 256, + "%19s%6s %4s[%2s] \"\" ( )\n", "", "", "", ""); + if (old_output) { + loc_sma_lid[0] = '\0'; + } + } + + + if (add_sw_settings) { + snprintf(link_str, 256, + "(%3s %s %6s / %8s) (HOQ:%d VL_Stall:%d)", + ibnd_linkwidth_str(port->info.link_width_active), + get_linkspeed_str(port->info.link_speed_active), + ibnd_linkstate_str(port->info.port_state), + ibnd_physstate_str(port->info.phys_state), + port->info.hoq_lifetime, + port->info.vl_stall_count + ); + } else { + snprintf(link_str, 256, + "(%3s %s %6s / %8s)", + ibnd_linkwidth_str(port->info.link_width_active), + get_linkspeed_str(port->info.link_speed_active), + ibnd_linkstate_str(port->info.port_state), + ibnd_physstate_str(port->info.phys_state) + ); + } + + if (port->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->ext_portnum); + else + ext_port_str[0] = '\0'; + + if (line_mode) { + static char name_buf[256]; + char *node_name = ""; + + if (old_output && (!port->remoteport)) { + node_name = ""; + } else { + strncpy(name_buf, node->nodedesc, 256); + node_name = remap_node_name(node_name_map, + node->info.nodeguid, + name_buf); + } + + printf("0x%016"PRIx64" \"%30s\" %6s %4d[%2s] ==%s==> %s", + node->info.nodeguid, + node_name, + loc_sma_lid, port->portnum, + ext_port_str, + link_str, + remote_str + ); + } else { + printf(" %6s %4d[%2s] ==%s==> %s", + loc_sma_lid, port->portnum, + ext_port_str, + link_str, + remote_str + ); + } +} + +void +print_switch(ibnd_node_t *node, void *user_data) +{ + int i = 0; + + if (!line_mode) { + char name_buf[256]; + strncpy(name_buf, node->nodedesc, 256); + printf("Switch 0x%016"PRIx64" %s:\n", + node->info.nodeguid, + remap_node_name(node_name_map, + node->info.nodeguid, + name_buf)); + } + + for (i = 1; i <= node->info.numports; i++) { + ibnd_port_t *port = node->ports[i]; + if (!port) + continue; + if (!down_links_only || port->info.port_state == IBND_LINK_DOWN) { + print_port(node, port); + } + } +} + +void +usage(void) +{ + fprintf(stderr, + "Usage: %s [-hclp -S -D -C -P ]\n" + " Report link speed and connection for each port of each switch which is active\n" + " -h This help message\n" + " -S output only the node specified by guid\n" + " -D print only node specified by \n" + " -f specify node to start \"from\"\n" + " -n Number of hops to include away from specified node\n" + " -d print only down links\n" + " -l (line mode) print all information for each link on each line\n" + " -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n" + + + " -t timeout for any single fabric query\n" + " -s show progress during scan\n" + " --node-name-map use specified node name map\n" + + " -C use selected Channel Adaptor name for queries\n" + " -P use selected channel adaptor port for queries\n" + " -g print port guids instead of node guids\n" + " --debug print debug messages\n" + " -R (this option is obsolete and does nothing)\n" + , + argv0); + exit(-1); +} + +int +main(int argc, char **argv) +{ + char *ca = 0; + int ca_port = 0; + ibnd_fabric_t *fabric = NULL; + uint64_t guid = 0; + char *dr_path = NULL; + char *from = NULL; + int hops = 0; + ib_portid_t port_id; + + static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:R"; + static const struct option long_opts[] = { + { "S", 1, 0, 'S'}, + { "D", 1, 0, 'D'}, + { "num-hops", 1, 0, 'n'}, + { "down-links-only", 0, 0, 'd'}, + { "line-mode", 0, 0, 'l'}, + { "ca-name", 1, 0, 'C'}, + { "ca-port", 1, 0, 'P'}, + { "timeout", 1, 0, 't'}, + { "show", 0, 0, 's'}, + { "print-port-guids", 0, 0, 'g'}, + { "print-additional", 0, 0, 'p'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { "node-name-map", 1, 0, 1}, + { "debug", 0, 0, 2}, + { "compat", 0, 0, 3}, + { "from", 1, 0, 'f'}, + { "R", 0, 0, 'R'}, + { } + }; + + f = stdout; + + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 1: + node_name_map_file = strdup(optarg); + break; + case 2: + ibnd_debug(1); + break; + case 3: + old_output = 1; + break; + case 'f': + from = strdup(optarg); + break; + case 'C': + ca = strdup(optarg); + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'D': + dr_path = strdup(optarg); + break; + case 'n': + hops = (int)strtol(optarg, NULL, 0); + break; + case 'd': + down_links_only = 1; + break; + case 'l': + line_mode = 1; + break; + case 't': + timeout_ms = strtoul(optarg, 0, 0); + break; + case 's': + ibnd_show_progress(1); + break; + case 'g': + print_port_guids = 1; + break; + case 'S': + guid = (uint64_t)strtoull(optarg, 0, 0); + break; + case 'p': + add_sw_settings = 1; + break; + case 'R': + /* GNDN */ + break; + default: + usage(); + break; + } + } + argc -= optind; + argv += optind; + + if (argc && !(f = fopen(argv[0], "w"))) + fprintf(stderr, "can't open file %s for writing", argv[0]); + + node_name_map = open_node_name_map(node_name_map_file); + + if (from) { + /* only scan part of the fabric */ + str2drpath(&(port_id.drpath), from, 0, 0); + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + guid = 0; + } else { + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + } + + if (guid) { + ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid); + print_switch(sw, NULL); + } else if (dr_path) { + ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path); + print_switch(sw, NULL); + } else { + ibnd_iter_nodes_type(fabric, print_switch, IBND_SWITCH_NODE, NULL); + } + + ibnd_destroy_fabric(fabric); + + close_node_name_map(node_name_map); + exit(0); +} -- 1.5.4.5 From weiny2 at llnl.gov Fri Jan 9 15:47:59 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 9 Jan 2009 15:47:59 -0800 Subject: [ofa-general] [PATCH 3/3 - no ibcommon] Convert ibnetdiscover to use new ibnetdisc library. Message-ID: <20090109154759.3f5d97b2.weiny2@llnl.gov> >From 71940caa935c4757b0b4d4368ecafd34eb1d6dc1 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Tue, 2 Dec 2008 16:29:29 -0800 Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library. Removed -e and -v since they were somewhat redundant with the -d option. All other functionality is preserved Signed-off-by: weiny2 at llnl.gov --- infiniband-diags/Makefile.am | 4 +- infiniband-diags/include/grouping.h | 113 ---- infiniband-diags/man/ibnetdiscover.8 | 10 +- infiniband-diags/scripts/dump_lfts.sh | 2 +- infiniband-diags/scripts/dump_mfts.sh | 2 +- infiniband-diags/src/grouping.c | 786 -------------------------- infiniband-diags/src/ibnetdiscover.c | 984 +++++++++++---------------------- 7 files changed, 315 insertions(+), 1586 deletions(-) delete mode 100644 infiniband-diags/include/grouping.h delete mode 100644 infiniband-diags/src/grouping.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index d127a4d..2ccf082 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -37,9 +37,9 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \ src_ibaddr_SOURCES = src/ibaddr.c src/ibdiag_common.c src_ibaddr_CFLAGS = -Wall $(DBGFLAGS) -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common.c +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/ibdiag_common.c src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS) -src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) +src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -L$(srcdir)/libibnetdisc -libnetdisc src_iblinkinfo_pl_SOURCES = src/iblinkinfo.c src_iblinkinfo_pl_CFLAGS = -Wall $(DBGFLAGS) diff --git a/infiniband-diags/include/grouping.h b/infiniband-diags/include/grouping.h deleted file mode 100644 index e54efef..0000000 --- a/infiniband-diags/include/grouping.h +++ /dev/null @@ -1,113 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. - * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#ifndef _GROUPING_H_ -#define _GROUPING_H_ - -/*========================================================*/ -/* FABRIC SCANNER SPECIFIC DATA */ -/*========================================================*/ - -#define SPINES_MAX_NUM 12 -#define LINES_MAX_NUM 36 - -typedef struct ChassisList ChassisList; -typedef struct AllChassisList AllChassisList; - -struct ChassisList { - ChassisList *next; - uint64_t chassisguid; - int chassisnum; - int chassistype; - int nodecount; /* used for grouping by SystemImageGUID */ - Node *spinenode[SPINES_MAX_NUM + 1]; - Node *linenode[LINES_MAX_NUM + 1]; -}; - -struct AllChassisList { - ChassisList *first; - ChassisList *current; - ChassisList *last; -}; - -/*========================================================*/ -/* CHASSIS RECOGNITION SPECIFIC DATA */ -/*========================================================*/ - -/* Device IDs */ -#define VTR_DEVID_IB_FC_ROUTER 0x5a00 -#define VTR_DEVID_IB_IP_ROUTER 0x5a01 -#define VTR_DEVID_ISR9600_SPINE 0x5a02 -#define VTR_DEVID_ISR9600_LEAF 0x5a03 -#define VTR_DEVID_HCA1 0x5a04 -#define VTR_DEVID_HCA2 0x5a44 -#define VTR_DEVID_HCA3 0x6278 -#define VTR_DEVID_SW_6IB4 0x5a05 -#define VTR_DEVID_ISR9024 0x5a06 -#define VTR_DEVID_ISR9288 0x5a07 -#define VTR_DEVID_SLB24 0x5a09 -#define VTR_DEVID_SFB12 0x5a08 -#define VTR_DEVID_SFB4 0x5a0b -#define VTR_DEVID_ISR9024_12 0x5a0c -#define VTR_DEVID_SLB8 0x5a0d -#define VTR_DEVID_RLX_SWITCH_BLADE 0x5a20 -#define VTR_DEVID_ISR9024_DDR 0x5a31 -#define VTR_DEVID_SFB12_DDR 0x5a32 -#define VTR_DEVID_SFB4_DDR 0x5a33 -#define VTR_DEVID_SLB24_DDR 0x5a34 -#define VTR_DEVID_SFB2012 0x5a37 -#define VTR_DEVID_SLB2024 0x5a38 -#define VTR_DEVID_ISR2012 0x5a39 -#define VTR_DEVID_SFB2004 0x5a40 -#define VTR_DEVID_ISR2004 0x5a41 -#define VTR_DEVID_SRB2004 0x5a42 - -enum ChassisType { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT }; -enum ChassisSlot { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS }; - -/*========================================================*/ -/* External interface */ -/*========================================================*/ - -ChassisList *group_nodes(); -char *portmapstring(Port *port); -char *get_chassis_type(unsigned char chassistype); -char *get_chassis_slot(unsigned char chassisslot); -uint64_t get_chassis_guid(unsigned char chassisnum); - -int is_xsigo_guid(uint64_t guid); -int is_xsigo_tca(uint64_t guid); -int is_xsigo_hca(uint64_t guid); - -#endif /* _GROUPING_H_ */ diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 index 958efa9..768d392 100644 --- a/infiniband-diags/man/ibnetdiscover.8 +++ b/infiniband-diags/man/ibnetdiscover.8 @@ -5,7 +5,7 @@ ibnetdiscover \- discover InfiniBand topology .SH SYNOPSIS .B ibnetdiscover -[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] +[\-d(ebug)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] .SH DESCRIPTION .PP @@ -37,7 +37,7 @@ List of connected switches List of connected routers .TP \fB\-s\fR, \fB\-\-show\fR -Show more information +Show progress information during discovery. .TP \fB\-\-node\-name\-map\fR Specify a node name map. The node name map file maps GUIDs to more user friendly @@ -57,15 +57,9 @@ using the util_name -h syntax. # Debugging flags .PP \-d raise the IB debugging level. - May be used several times (-ddd or -d -d -d). -.PP -\-e show send and receive errors (timeouts and others) .PP \-h show the usage message .PP -\-v increase the application verbosity level. - May be used several times (-vv or -v -v -v) -.PP \-V show the version info. # Other common flags: diff --git a/infiniband-diags/scripts/dump_lfts.sh b/infiniband-diags/scripts/dump_lfts.sh index ebca705..9d6a986 100755 --- a/infiniband-diags/scripts/dump_lfts.sh +++ b/infiniband-diags/scripts/dump_lfts.sh @@ -22,7 +22,7 @@ done dump_by_dr_path () { -for sw_dr in `ibnetdiscover $ca_info -v \ +for sw_dr in `ibnetdiscover $ca_info -s \ | sed -ne '/^DR path .* switch /s/^DR path \([,|0-9]\+\) ->.*{\([0-9|a-f]\+\)}.*$/\2 \1/p' \ | sort -u \ | awk 'BEGIN {guid=0;} {if ($1 != guid) { guid=$1; print $2; }}'` ; do diff --git a/infiniband-diags/scripts/dump_mfts.sh b/infiniband-diags/scripts/dump_mfts.sh index 39fc5fb..cef6ad3 100755 --- a/infiniband-diags/scripts/dump_mfts.sh +++ b/infiniband-diags/scripts/dump_mfts.sh @@ -22,7 +22,7 @@ done dump_by_dr_path () { -for sw_dr in `ibnetdiscover $ca_info -v \ +for sw_dr in `ibnetdiscover $ca_info -s \ | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ | sed -e 's/\]\[/,/g' \ | sort -u` ; do diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c deleted file mode 100644 index 94ab859..0000000 --- a/infiniband-diags/src/grouping.c +++ /dev/null @@ -1,786 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. - * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/*========================================================*/ -/* FABRIC SCANNER SPECIFIC DATA */ -/*========================================================*/ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include -#include - -#include - -#include "ibnetdiscover.h" -#include "grouping.h" - -#define OUT_BUFFER_SIZE 16 - - -extern Node *nodesdist[MAXHOPS+1]; /* last is CA list */ -extern Node *mynode; -extern Port *myport; -extern int maxhops_discovered; - -AllChassisList mylist; - -char *ChassisTypeStr[5] = { "", "ISR9288", "ISR9096", "ISR2012", "ISR2004" }; -char *ChassisSlotStr[4] = { "", "Line", "Spine", "SRBD" }; - - -char *get_chassis_type(unsigned char chassistype) -{ - if (chassistype == UNRESOLVED_CT || chassistype > ISR2004_CT) - return NULL; - return ChassisTypeStr[chassistype]; -} - -char *get_chassis_slot(unsigned char chassisslot) -{ - if (chassisslot == UNRESOLVED_CS || chassisslot > SRBD_CS) - return NULL; - return ChassisSlotStr[chassisslot]; -} - -static struct ChassisList *find_chassisnum(unsigned char chassisnum) -{ - ChassisList *current; - - for (current = mylist.first; current; current = current->next) { - if (current->chassisnum == chassisnum) - return current; - } - - return NULL; -} - -static uint64_t topspin_chassisguid(uint64_t guid) -{ - /* Byte 3 in system image GUID is chassis type, and */ - /* Byte 4 is location ID (slot) so just mask off byte 4 */ - return guid & 0xffffffff00ffffffULL; -} - -int is_xsigo_guid(uint64_t guid) -{ - if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) - return 1; - else - return 0; -} - -static int is_xsigo_leafone(uint64_t guid) -{ - if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) - return 1; - else - return 0; -} - -int is_xsigo_hca(uint64_t guid) -{ - /* NodeType 2 is HCA */ - if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) - return 1; - else - return 0; -} - -int is_xsigo_tca(uint64_t guid) -{ - /* NodeType 3 is TCA */ - if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) - return 1; - else - return 0; -} - -static int is_xsigo_ca(uint64_t guid) -{ - if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) - return 1; - else - return 0; -} - -static int is_xsigo_switch(uint64_t guid) -{ - if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) - return 1; - else - return 0; -} - -static uint64_t xsigo_chassisguid(Node *node) -{ - if (!is_xsigo_ca(node->sysimgguid)) { - /* Byte 3 is NodeType and byte 4 is PortType */ - /* If NodeType is 1 (switch), PortType is masked */ - if (is_xsigo_switch(node->sysimgguid)) - return node->sysimgguid & 0xffffffff00ffffffULL; - else - return node->sysimgguid; - } else { - /* Is there a peer port ? */ - if (!node->ports->remoteport) - return node->sysimgguid; - - /* If peer port is Leaf 1, use its chassis GUID */ - if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) - return node->ports->remoteport->node->sysimgguid & - 0xffffffff00ffffffULL; - else - return node->sysimgguid; - } -} - -static uint64_t get_chassisguid(Node *node) -{ - if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) - return topspin_chassisguid(node->sysimgguid); - else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) - return xsigo_chassisguid(node); - else - return node->sysimgguid; -} - -static struct ChassisList *find_chassisguid(Node *node) -{ - ChassisList *current; - uint64_t chguid; - - chguid = get_chassisguid(node); - for (current = mylist.first; current; current = current->next) { - if (current->chassisguid == chguid) - return current; - } - - return NULL; -} - -uint64_t get_chassis_guid(unsigned char chassisnum) -{ - ChassisList *chassis; - - chassis = find_chassisnum(chassisnum); - if (chassis) - return chassis->chassisguid; - else - return 0; -} - -static int is_router(Node *node) -{ - return (node->devid == VTR_DEVID_IB_FC_ROUTER || - node->devid == VTR_DEVID_IB_IP_ROUTER); -} - -static int is_spine_9096(Node *node) -{ - return (node->devid == VTR_DEVID_SFB4 || - node->devid == VTR_DEVID_SFB4_DDR); -} - -static int is_spine_9288(Node *node) -{ - return (node->devid == VTR_DEVID_SFB12 || - node->devid == VTR_DEVID_SFB12_DDR); -} - -static int is_spine_2004(Node *node) -{ - return (node->devid == VTR_DEVID_SFB2004); -} - -static int is_spine_2012(Node *node) -{ - return (node->devid == VTR_DEVID_SFB2012); -} - -static int is_spine(Node *node) -{ - return (is_spine_9096(node) || is_spine_9288(node) || - is_spine_2004(node) || is_spine_2012(node)); -} - -static int is_line_24(Node *node) -{ - return (node->devid == VTR_DEVID_SLB24 || - node->devid == VTR_DEVID_SLB24_DDR || - node->devid == VTR_DEVID_SRB2004); -} - -static int is_line_8(Node *node) -{ - return (node->devid == VTR_DEVID_SLB8); -} - -static int is_line_2024(Node *node) -{ - return (node->devid == VTR_DEVID_SLB2024); -} - -static int is_line(Node *node) -{ - return (is_line_24(node) || is_line_8(node) || is_line_2024(node)); -} - -int is_chassis_switch(Node *node) -{ - return (is_spine(node) || is_line(node)); -} - -/* these structs help find Line (Anafa) slot number while using spine portnum */ -int line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; -int anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; -int line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; -int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; - -/* IPR FCR modules connectivity while using sFB4 port as reference */ -int ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; - -/* these structs help find Spine (Anafa) slot number while using spine portnum */ -int spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ - -static void get_sfb_slot(Node *node, Port *lineport) -{ - ChassisRecord *ch = node->chrecord; - - ch->chassisslot = SPINE_CS; - if (is_spine_9096(node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = spine4_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; - } else if (is_spine_9288(node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = spine12_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; - } else if (is_spine_2012(node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = spine12_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; - } else if (is_spine_2004(node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = spine4_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, node->nodeguid); - } -} - -static void get_router_slot(Node *node, Port *spineport) -{ - ChassisRecord *ch = node->chrecord; - int guessnum = 0; - - if (!ch) { - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - ch = node->chrecord; - } - - ch->chassisslot = SRBD_CS; - if (is_spine_9096(spineport->node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; - } else if (is_spine_9288(spineport->node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - /* this is a smart guess based on nodeguids order on sFB-12 module */ - guessnum = spineport->node->nodeguid % 4; - /* module 1 <--> remote anafa 3 */ - /* module 2 <--> remote anafa 2 */ - /* module 3 <--> remote anafa 1 */ - ch->anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); - } else if (is_spine_2012(spineport->node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - /* this is a smart guess based on nodeguids order on sFB-12 module */ - guessnum = spineport->node->nodeguid % 4; - // module 1 <--> remote anafa 3 - // module 2 <--> remote anafa 2 - // module 3 <--> remote anafa 1 - ch->anafanum = (guessnum == 3? 1 : (guessnum == 1 ? 3 : 2)); - } else if (is_spine_2004(spineport->node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, spineport->node->nodeguid); - } -} - -static void get_slb_slot(ChassisRecord *ch, Port *spineport) -{ - ch->chassisslot = LINE_CS; - if (is_spine_9096(spineport->node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; - } else if (is_spine_9288(spineport->node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; - } else if (is_spine_2012(spineport->node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; - } else if (is_spine_2004(spineport->node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, spineport->node->nodeguid); - } -} - -/* - This function called for every Voltaire node in fabric - It could be optimized so, but time overhead is very small - and its only diag.util -*/ -static void fill_chassis_record(Node *node) -{ - Port *port; - Node *remnode = 0; - ChassisRecord *ch = 0; - - if (node->chrecord) /* somehow this node has already been passed */ - return; - - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - - ch = node->chrecord; - - /* node is router only in case of using unique lid */ - /* (which is lid of chassis router port) */ - /* in such case node->ports is actually a requested port... */ - if (is_router(node) && is_spine(node->ports->remoteport->node)) - get_router_slot(node, node->ports->remoteport); - else if (is_spine(node)) { - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - if (remnode->type != SWITCH_NODE) { - if (!remnode->chrecord) - get_router_slot(remnode, port); - continue; - } - if (!ch->chassistype) - /* we assume here that remoteport belongs to line */ - get_sfb_slot(node, port->remoteport); - - /* we could break here, but need to find if more routers connected */ - } - - } else if (is_line(node)) { - for (port = node->ports; port; port = port->next) { - if (port->portnum > 12) - continue; - if (!port->remoteport) - continue; - /* we assume here that remoteport belongs to spine */ - get_slb_slot(ch, port->remoteport); - break; - } - } - - return; -} - -static int get_line_index(Node *node) -{ - int retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum; - - if (retval > LINES_MAX_NUM || retval < 1) - IBPANIC("Internal error"); - return retval; -} - -static int get_spine_index(Node *node) -{ - int retval; - - if (is_spine_9288(node) || is_spine_2012(node)) - retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum; - else - retval = node->chrecord->slotnum; - - if (retval > SPINES_MAX_NUM || retval < 1) - IBPANIC("Internal error"); - return retval; -} - -static void insert_line_router(Node *node, ChassisList *chassislist) -{ - int i = get_line_index(node); - - if (chassislist->linenode[i]) - return; /* already filled slot */ - - chassislist->linenode[i] = node; - node->chrecord->chassisnum = chassislist->chassisnum; -} - -static void insert_spine(Node *node, ChassisList *chassislist) -{ - int i = get_spine_index(node); - - if (chassislist->spinenode[i]) - return; /* already filled slot */ - - chassislist->spinenode[i] = node; - node->chrecord->chassisnum = chassislist->chassisnum; -} - -static void pass_on_lines_catch_spines(ChassisList *chassislist) -{ - Node *node, *remnode; - Port *port; - int i; - - for (i = 1; i <= LINES_MAX_NUM; i++) { - node = chassislist->linenode[i]; - - if (!(node && is_line(node))) - continue; /* empty slot or router */ - - for (port = node->ports; port; port = port->next) { - if (port->portnum > 12) - continue; - - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - spine not initialized ? FIXME */ - insert_spine(remnode, chassislist); - } - } -} - -static void pass_on_spines_catch_lines(ChassisList *chassislist) -{ - Node *node, *remnode; - Port *port; - int i; - - for (i = 1; i <= SPINES_MAX_NUM; i++) { - node = chassislist->spinenode[i]; - if (!node) - continue; /* empty slot */ - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - line/router not initialized ? FIXME */ - insert_line_router(remnode, chassislist); - } - } -} - -/* - Stupid interpolation algorithm... - But nothing to do - have to be compliant with VoltaireSM/NMS -*/ -static void pass_on_spines_interpolate_chguid(ChassisList *chassislist) -{ - Node *node; - int i; - - for (i = 1; i <= SPINES_MAX_NUM; i++) { - node = chassislist->spinenode[i]; - if (!node) - continue; /* skip the empty slots */ - - /* take first guid minus one to be consistent with SM */ - chassislist->chassisguid = node->nodeguid - 1; - break; - } -} - -/* - This function fills chassislist structure with all nodes - in that chassis - chassislist structure = structure of one standalone chassis -*/ -static void build_chassis(Node *node, ChassisList *chassislist) -{ - Node *remnode = 0; - Port *port = 0; - - /* we get here with node = chassis_spine */ - chassislist->chassistype = node->chrecord->chassistype; - insert_spine(node, chassislist); - - /* loop: pass on all ports of node */ - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - line or router not initialized ? FIXME */ - - insert_line_router(remnode, chassislist); - } - - pass_on_lines_catch_spines(chassislist); - /* this pass needed for to catch routers, since routers connected only */ - /* to spines in slot 1 or 4 and we could miss them first time */ - pass_on_spines_catch_lines(chassislist); - - /* additional 2 passes needed for to overcome a problem of pure "in-chassis" */ - /* connectivity - extra pass to ensure that all related chips/modules */ - /* inserted into the chassislist */ - pass_on_lines_catch_spines(chassislist); - pass_on_spines_catch_lines(chassislist); - pass_on_spines_interpolate_chguid(chassislist); -} - -/*========================================================*/ -/* INTERNAL TO EXTERNAL PORT MAPPING */ -/*========================================================*/ - -/* -Description : On ISR9288/9096 external ports indexing - is not matching the internal ( anafa ) port - indexes. Use this MAP to translate the data you get from - the OpenIB diagnostics (smpquery, ibroute, ibtracert, etc.) - - -Module : sLB-24 - anafa 1 anafa 2 -ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 -int port | 22 23 24 18 17 16 | 22 23 24 18 17 16 -ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 -int port | 19 20 21 15 14 13 | 19 20 21 15 14 13 ------------------------------------------------- - -Module : sLB-8 - anafa 1 anafa 2 -ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 -int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 -ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 -int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 - ------------> - anafa 1 anafa 2 -ext port | - - 5 - - 6 | - - 7 - - 8 -int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 -ext port | - - 1 - - 2 | - - 3 - - 4 -int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 ------------------------------------------------- - -Module : sLB-2024 - -ext port | 13 14 15 16 17 18 19 20 21 22 23 24 -A1 int port| 13 14 15 16 17 18 19 20 21 22 23 24 -ext port | 1 2 3 4 5 6 7 8 9 10 11 12 -A2 int port| 13 14 15 16 17 18 19 20 21 22 23 24 ---------------------------------------------------- - -*/ - -int int2ext_map_slb24[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 4, 18, 17, 16, 1, 2, 3, 13, 14, 15 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 11, 10, 24, 23, 22, 7, 8, 9, 19, 20, 21 } - }; -int int2ext_map_slb8[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 6, 6, 6, 1, 1, 1, 5, 5, 5 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 8, 8, 8, 3, 3, 3, 7, 7, 7 } - }; -int int2ext_map_slb2024[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } - }; -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ - -/* - This function relevant only for line modules/chips - Returns string with external port index -*/ -char *portmapstring(Port *port) -{ - static char mapping[OUT_BUFFER_SIZE]; - ChassisRecord *ch = port->node->chrecord; - int portnum = port->portnum; - int chipnum = 0; - int pindex = 0; - Node *node = port->node; - - if (!ch || !is_line(node) || (portnum < 13 || portnum > 24)) - return NULL; - - if (ch->anafanum < 1 || ch->anafanum > 2) - return NULL; - - memset(mapping, 0, sizeof(mapping)); - - chipnum = ch->anafanum - 1; - - if (is_line_24(node)) - pindex = int2ext_map_slb24[chipnum][portnum]; - else if (is_line_2024(node)) - pindex = int2ext_map_slb2024[chipnum][portnum]; - else - pindex = int2ext_map_slb8[chipnum][portnum]; - - sprintf(mapping, "[ext %d]", pindex); - - return mapping; -} - -static void add_chassislist() -{ - if (!(mylist.current = calloc(1, sizeof(ChassisList)))) - IBPANIC("out of mem"); - - if (mylist.first == NULL) { - mylist.first = mylist.current; - mylist.last = mylist.current; - } else { - mylist.last->next = mylist.current; - mylist.current->next = NULL; - mylist.last = mylist.current; - } -} - -/* - Main grouping function - Algorithm: - 1. pass on every Voltaire node - 2. catch spine chip for every Voltaire node - 2.1 build/interpolate chassis around this chip - 2.2 go to 1. - 3. pass on non Voltaire nodes (SystemImageGUID based grouping) - 4. now group non Voltaire nodes by SystemImageGUID -*/ -ChassisList *group_nodes() -{ - Node *node; - int dist; - int chassisnum = 0; - struct ChassisList *chassis; - - mylist.first = NULL; - mylist.current = NULL; - mylist.last = NULL; - - /* first pass on switches and build for every Voltaire node */ - /* an appropriate chassis record (slotnum and position) */ - /* according to internal connectivity */ - /* not very efficient but clear code so... */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - fill_chassis_record(node); - } - } - - /* separate every Voltaire chassis from each other and build linked list of them */ - /* algorithm: catch spine and find all surrounding nodes */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid != VTR_VENDOR_ID) - continue; - if (!node->chrecord || node->chrecord->chassisnum || !is_spine(node)) - continue; - add_chassislist(); - mylist.current->chassisnum = ++chassisnum; - build_chassis(node, mylist.current); - } - } - - /* now make pass on nodes for chassis which are not Voltaire */ - /* grouped by common SystemImageGUID */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - continue; - if (node->sysimgguid) { - chassis = find_chassisguid(node); - if (chassis) - chassis->nodecount++; - else { - /* Possible new chassis */ - add_chassislist(); - mylist.current->chassisguid = get_chassisguid(node); - mylist.current->nodecount = 1; - } - } - } - } - - /* now, make another pass to see which nodes are part of chassis */ - /* (defined as chassis->nodecount > 1) */ - for (dist = 0; dist <= MAXHOPS; ) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - continue; - if (node->sysimgguid) { - chassis = find_chassisguid(node); - if (chassis && chassis->nodecount > 1) { - if (!chassis->chassisnum) - chassis->chassisnum = ++chassisnum; - if (!node->chrecord) { - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - node->chrecord->chassisnum = chassis->chassisnum; - } - } - } - } - if (dist == maxhops_discovered) - dist = MAXHOPS; /* skip to CAs */ - else - dist++; - } - - return (mylist.first); -} diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 296cb07..0c4aa13 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -50,479 +51,104 @@ #include #include #include +#include -#include "ibnetdiscover.h" -#include "grouping.h" #include "ibdiag_common.h" -static char *node_type_str[] = { - "???", - "ca", - "switch", - "router", - "iwarp rnic" -}; - -static char *linkwidth_str[] = { - "??", - "1x", - "4x", - "??", - "8x", - "??", - "??", - "??", - "12x" -}; - -static char *linkspeed_str[] = { - "???", - "SDR", - "DDR", - "???", - "QDR" -}; - -static int timeout = 2000; /* ms */ -static int dumplevel = 0; +static int debug; static int verbose; -static FILE *f; +#define LIST_CA_NODE (1 << IBND_CA_NODE) +#define LIST_SWITCH_NODE (1 << IBND_SWITCH_NODE) +#define LIST_ROUTER_NODE (1 << IBND_ROUTER_NODE) char *argv0 = "ibnetdiscover"; +static FILE *f; static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; -Node *nodesdist[MAXHOPS+1]; /* last is Ca list */ -Node *mynode; -int maxhops_discovered = 0; - -struct ChassisList *chassis = NULL; - -static char * -get_linkwidth_str(int linkwidth) -{ - if (linkwidth > 8) - return linkwidth_str[0]; - else - return linkwidth_str[linkwidth]; -} - -static char * -get_linkspeed_str(int linkspeed) -{ - if (linkspeed > 4) - return linkspeed_str[0]; - else - return linkspeed_str[linkspeed]; -} - -static inline const char* -node_type_str2(Node *node) -{ - switch(node->type) { - case SWITCH_NODE: return "SW"; - case CA_NODE: return "CA"; - case ROUTER_NODE: return "RT"; - } - return "??"; -} - -void -decode_port_info(void *pi, Port *port) -{ - mad_decode_field(pi, IB_PORT_LID_F, &port->lid); - mad_decode_field(pi, IB_PORT_LMC_F, &port->lmc); - mad_decode_field(pi, IB_PORT_STATE_F, &port->state); - mad_decode_field(pi, IB_PORT_PHYS_STATE_F, &port->physstate); - mad_decode_field(pi, IB_PORT_LINK_WIDTH_ACTIVE_F, &port->linkwidth); - mad_decode_field(pi, IB_PORT_LINK_SPEED_ACTIVE_F, &port->linkspeed); -} - - -int -get_port(Port *port, int portnum, ib_portid_t *portid) -{ - char portinfo[64]; - void *pi = portinfo; - - port->portnum = portnum; - - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout)) - return -1; - decode_port_info(pi, port); - - DEBUG("portid %s portnum %d: lid %d state %d physstate %d %s %s", - portid2str(portid), portnum, port->lid, port->state, port->physstate, get_linkwidth_str(port->linkwidth), get_linkspeed_str(port->linkspeed)); - return 1; -} -/* - * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. - */ -int -get_node(Node *node, Port *port, ib_portid_t *portid) -{ - char portinfo[64]; - char switchinfo[64]; - void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc; - void *si = switchinfo; - - if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout)) - return -1; - - mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid); - mad_decode_field(ni, IB_NODE_TYPE_F, &node->type); - mad_decode_field(ni, IB_NODE_NPORTS_F, &node->numports); - mad_decode_field(ni, IB_NODE_DEVID_F, &node->devid); - mad_decode_field(ni, IB_NODE_VENDORID_F, &node->vendid); - mad_decode_field(ni, IB_NODE_SYSTEM_GUID_F, &node->sysimgguid); - mad_decode_field(ni, IB_NODE_PORT_GUID_F, &node->portguid); - mad_decode_field(ni, IB_NODE_LOCAL_PORT_F, &node->localport); - port->portnum = node->localport; - port->portguid = node->portguid; - - if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout)) - return -1; - - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout)) - return -1; - decode_port_info(pi, port); - - if (node->type != SWITCH_NODE) - return 0; - - node->smalid = port->lid; - node->smalmc = port->lmc; - - /* after we have the sma information find out the real PortInfo for this port */ - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, node->localport, timeout)) - return -1; - decode_port_info(pi, port); - - if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) - node->smaenhsp0 = 0; /* assume base SP0 */ - else - mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); - - DEBUG("portid %s: got switch node %" PRIx64 " '%s'", - portid2str(portid), node->nodeguid, node->nodedesc); - return 1; -} - -static int -extend_dpath(ib_dr_path_t *path, int nextport) -{ - if (path->cnt+2 >= sizeof(path->p)) - return -1; - ++path->cnt; - if (path->cnt > maxhops_discovered) - maxhops_discovered = path->cnt; - path->p[path->cnt] = nextport; - return path->cnt; -} - -static void -dump_endnode(ib_portid_t *path, char *prompt, Node *node, Port *port) -{ - if (!dumplevel) - return; - - fprintf(f, "%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n", - portid2str(path), prompt, - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum, - port->lid, port->lid + (1 << port->lmc) - 1, - clean_nodedesc(node->nodedesc)); -} - -#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) -#define HTSZ 137 - -static Node *nodestbl[HTSZ]; - -static Node * -find_node(Node *new) -{ - int hash = HASHGUID(new->nodeguid) % HTSZ; - Node *node; - - for (node = nodestbl[hash]; node; node = node->htnext) - if (node->nodeguid == new->nodeguid) - return node; - - return NULL; -} - -static Node * -create_node(Node *temp, ib_portid_t *path, int dist) -{ - Node *node; - int hash = HASHGUID(temp->nodeguid) % HTSZ; - - node = malloc(sizeof(*node)); - if (!node) - return NULL; - - memcpy(node, temp, sizeof(*node)); - node->dist = dist; - node->path = *path; - - node->htnext = nodestbl[hash]; - nodestbl[hash] = node; - - if (node->type != SWITCH_NODE) - dist = MAXHOPS; /* special Ca list */ - - node->dnext = nodesdist[dist]; - nodesdist[dist] = node; - - return node; -} - -static Port * -find_port(Node *node, Port *port) -{ - Port *old; - - for (old = node->ports; old; old = old->next) - if (old->portnum == port->portnum) - return old; - - return NULL; -} - -static Port * -create_port(Node *node, Port *temp) -{ - Port *port; - - port = malloc(sizeof(*port)); - if (!port) - return NULL; - - memcpy(port, temp, sizeof(*port)); - port->node = node; - port->next = node->ports; - node->ports = port; - - return port; -} - -static void -link_ports(Node *node, Port *port, Node *remotenode, Port *remoteport) -{ - DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u", - node->nodeguid, node, port, port->portnum, - remotenode->nodeguid, remotenode, remoteport, remoteport->portnum); - if (port->remoteport) - port->remoteport->remoteport = NULL; - if (remoteport->remoteport) - remoteport->remoteport->remoteport = NULL; - port->remoteport = remoteport; - remoteport->remoteport = port; -} - -static int -handle_port(Node *node, Port *port, ib_portid_t *path, int portnum, int dist) -{ - Node node_buf; - Port port_buf; - Node *remotenode, *oldnode; - Port *remoteport, *oldport; - - memset(&node_buf, 0, sizeof(node_buf)); - memset(&port_buf, 0, sizeof(port_buf)); - - DEBUG("handle node %p port %p:%d dist %d", node, port, portnum, dist); - if (port->physstate != 5) /* LinkUp */ - return -1; - - if (extend_dpath(&path->drpath, portnum) < 0) - return -1; - - if (get_node(&node_buf, &port_buf, path) < 0) { - IBWARN("NodeInfo on %s failed, skipping port", - portid2str(path)); - path->drpath.cnt--; /* restore path */ - return -1; - } - - oldnode = find_node(&node_buf); - if (oldnode) - remotenode = oldnode; - else if (!(remotenode = create_node(&node_buf, path, dist + 1))) - IBERROR("no memory"); - - oldport = find_port(remotenode, &port_buf); - if (oldport) { - remoteport = oldport; - if (node != remotenode || port != remoteport) - IBWARN("port moving..."); - } else if (!(remoteport = create_port(remotenode, &port_buf))) - IBERROR("no memory"); - - dump_endnode(path, oldnode ? "known remote" : "new remote", - remotenode, remoteport); - - link_ports(node, port, remotenode, remoteport); - - path->drpath.cnt--; /* restore path */ - return 0; -} - -/* - * Return 1 if found, 0 if not, -1 on errors. - */ -static int -discover(ib_portid_t *from) -{ - Node node_buf; - Port port_buf; - Node *node; - Port *port; - int i; - int dist = 0; - ib_portid_t *path; - - DEBUG("from %s", portid2str(from)); - - memset(&node_buf, 0, sizeof(node_buf)); - memset(&port_buf, 0, sizeof(port_buf)); - - if (get_node(&node_buf, &port_buf, from) < 0) { - IBWARN("can't reach node %s", portid2str(from)); - return -1; - } - - node = create_node(&node_buf, from, 0); - if (!node) - IBERROR("out of memory"); +static int timeout_ms = 2000; - mynode = node; - - port = create_port(node, &port_buf); - if (!port) - IBERROR("out of memory"); - - if (node->type != SWITCH_NODE && - handle_port(node, port, from, node->localport, 0) < 0) - return 0; - - for (dist = 0; dist < MAXHOPS; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - path = &node->path; - - DEBUG("dist %d node %p", dist, node); - dump_endnode(path, "processing", node, port); - - for (i = 1; i <= node->numports; i++) { - if (i == node->localport) - continue; - - if (get_port(&port_buf, i, path) < 0) { - IBWARN("can't reach node %s port %d", portid2str(path), i); - continue; - } - - port = find_port(node, &port_buf); - if (port) - continue; - - port = create_port(node, &port_buf); - if (!port) - IBERROR("out of memory"); - - /* If switch, set port GUID to node GUID */ - if (node->type == SWITCH_NODE) - port->portguid = node->portguid; - - handle_port(node, port, path, i, dist); - } - } - } - - return 0; -} char * -node_name(Node *node) +node_name(ibnd_node_t *node) { static char buf[256]; - switch(node->type) { - case SWITCH_NODE: - sprintf(buf, "\"%s", "S"); - break; - case CA_NODE: + switch(node->info.type) { + case IBND_CA_NODE: sprintf(buf, "\"%s", "H"); break; - case ROUTER_NODE: + case IBND_SWITCH_NODE: + sprintf(buf, "\"%s", "S"); + break; + case IBND_ROUTER_NODE: sprintf(buf, "\"%s", "R"); break; default: sprintf(buf, "\"%s", "?"); break; } - sprintf(buf+2, "-%016" PRIx64 "\"", node->nodeguid); + sprintf(buf+2, "-%016" PRIx64 "\"", node->info.nodeguid); return buf; } void -list_node(Node *node) +list_node(ibnd_node_t *node, void *user_data) { - char *node_type; - char *nodename = remap_node_name(node_name_map, node->nodeguid, + char *nodename = remap_node_name(node_name_map, node->info.nodeguid, node->nodedesc); - switch(node->type) { - case SWITCH_NODE: - node_type = "Switch"; - break; - case CA_NODE: - node_type = "Ca"; - break; - case ROUTER_NODE: - node_type = "Router"; - break; - default: - node_type = "???"; - break; - } fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n", - node_type, - node->nodeguid, node->numports, node->devid, node->vendid, + ibnd_node_type_str(node), + node->info.nodeguid, node->info.numports, node->info.devid, + node->info.vendid, nodename); free(nodename); } void -out_ids(Node *node, int group, char *chname) +list_nodes(ibnd_fabric_t *fabric, int list) +{ + if (list & LIST_CA_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IBND_CA_NODE, NULL); + } + if (list & LIST_SWITCH_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IBND_SWITCH_NODE, NULL); + } + if (list & LIST_ROUTER_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IBND_ROUTER_NODE, NULL); + } +} + +void +out_ids(ibnd_node_t *node, int group, char *chname) { - fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); - if (node->sysimgguid) - fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); - if (group - && node->chrecord && node->chrecord->chassisnum) { - fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); + fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->info.vendid, node->info.devid); + if (node->info.sysimgguid) + fprintf(f, "sysimgguid=0x%" PRIx64, node->info.sysimgguid); + if (group && node->chassis && node->chassis->chassisnum) { + fprintf(f, "\t\t# Chassis %d", node->chassis->chassisnum); if (chname) - fprintf(f, " (%s)", chname); - if (is_xsigo_tca(node->nodeguid) && node->ports->remoteport) - fprintf(f, " slot %d", node->ports->remoteport->portnum); + fprintf(f, " (%s)", clean_nodedesc(chname)); + if (ibnd_is_xsigo_tca(node->info.nodeguid) + && node->ports[1] + && node->ports[1]->remoteport) + fprintf(f, " slot %d", node->ports[1]->remoteport->portnum); } fprintf(f, "\n"); } + uint64_t -out_chassis(int chassisnum) +out_chassis(ibnd_fabric_t *fabric, int chassisnum) { uint64_t guid; fprintf(f, "\nChassis %d", chassisnum); - guid = get_chassis_guid(chassisnum); + guid = ibnd_get_chassis_guid(fabric, chassisnum); if (guid) fprintf(f, " (guid 0x%" PRIx64 ")", guid); fprintf(f, "\n"); @@ -530,54 +156,49 @@ out_chassis(int chassisnum) } void -out_switch(Node *node, int group, char *chname) +out_switch(ibnd_node_t *node, int group, char *chname) { char *str; + char str2[256]; char *nodename = NULL; out_ids(node, group, chname); - fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); - fprintf(f, "(%" PRIx64 ")", node->portguid); - /* Currently, only if Voltaire chassis */ - if (group - && node->chrecord && node->chrecord->chassisnum - && node->vendid == VTR_VENDOR_ID) { - str = get_chassis_type(node->chrecord->chassistype); + fprintf(f, "switchguid=0x%" PRIx64, node->info.nodeguid); + fprintf(f, "(%" PRIx64 ")", node->info.nodeportguid); + if (group) { + str = ibnd_get_chassis_type(node); if (str) fprintf(f, "%s ", str); - str = get_chassis_slot(node->chrecord->chassisslot); + str = ibnd_get_chassis_slot_str(node, str2, 256); if (str) - fprintf(f, "%s ", str); - fprintf(f, "%d Chip %d", node->chrecord->slotnum, node->chrecord->anafanum); + fprintf(f, "%s", str); } - nodename = remap_node_name(node_name_map, node->nodeguid, + nodename = remap_node_name(node_name_map, node->info.nodeguid, node->nodedesc); fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n", - node->numports, node_name(node), + node->info.numports, node_name(node), nodename, - node->smaenhsp0 ? "enhanced" : "base", + node->sw_info.smaenhsp0 ? "enhanced" : "base", node->smalid, node->smalmc); free(nodename); } void -out_ca(Node *node, int group, char *chname) +out_ca(ibnd_node_t *node, int group, char *chname) { char *node_type; char *node_type2; - char *nodename = remap_node_name(node_name_map, node->nodeguid, - node->nodedesc); out_ids(node, group, chname); - switch(node->type) { - case CA_NODE: + switch(node->info.type) { + case IBND_CA_NODE: node_type = "ca"; node_type2 = "Ca"; break; - case ROUTER_NODE: + case IBND_ROUTER_NODE: node_type = "rt"; node_type2 = "Rt"; break; @@ -587,37 +208,37 @@ out_ca(Node *node, int group, char *chname) break; } - fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); + fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->info.nodeguid); fprintf(f, "%s\t%d %s\t\t# \"%s\"", - node_type2, node->numports, node_name(node), - nodename); - if (group && is_xsigo_hca(node->nodeguid)) + node_type2, node->info.numports, node_name(node), + clean_nodedesc(node->nodedesc)); + if (group && ibnd_is_xsigo_hca(node->info.nodeguid)) fprintf(f, " (scp)"); fprintf(f, "\n"); - - free(nodename); } +#define OUT_BUFFER_SIZE 16 static char * -out_ext_port(Port *port, int group) +out_ext_port(ibnd_port_t *port, int group) { - char *str = NULL; + static char mapping[OUT_BUFFER_SIZE]; - /* Currently, only if Voltaire chassis */ - if (group - && port->node->chrecord && port->node->vendid == VTR_VENDOR_ID) - str = portmapstring(port); + if (group && port->ext_portnum != 0) { + snprintf(mapping, OUT_BUFFER_SIZE, + "[ext %d]", port->ext_portnum); + return (mapping); + } - return (str); + return (NULL); } void -out_switch_port(Port *port, int group) +out_switch_port(ibnd_port_t *port, int group) { char *ext_port_str = NULL; char *rem_nodename = NULL; - DEBUG("port %p:%d remoteport %p", port, port->portnum, port->remoteport); + DEBUG("port %p:%d remoteport %p\n", port, port->portnum, port->remoteport); fprintf(f, "[%d]", port->portnum); ext_port_str = out_ext_port(port, group); @@ -625,7 +246,7 @@ out_switch_port(Port *port, int group) fprintf(f, "%s", ext_port_str); rem_nodename = remap_node_name(node_name_map, - port->remoteport->node->nodeguid, + port->remoteport->node->info.nodeguid, port->remoteport->node->nodedesc); ext_port_str = out_ext_port(port->remoteport, group); @@ -633,17 +254,19 @@ out_switch_port(Port *port, int group) node_name(port->remoteport->node), port->remoteport->portnum, ext_port_str ? ext_port_str : ""); - if (port->remoteport->node->type != SWITCH_NODE) - fprintf(f, "(%" PRIx64 ") ", port->remoteport->portguid); + if (port->remoteport->node->info.type != IBND_SWITCH_NODE) + fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid); fprintf(f, "\t\t# \"%s\" lid %d %s%s", rem_nodename, - port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); + port->remoteport->node->info.type == IBND_SWITCH_NODE ? + port->remoteport->node->smalid : + port->remoteport->info.base_lid, + ibnd_linkwidth_str(port->info.link_width_active), + ibnd_linkspeed_str(port->info.link_speed_active, 0)); - if (is_xsigo_tca(port->remoteport->portguid)) + if (ibnd_is_xsigo_tca(port->remoteport->guid)) fprintf(f, " slot %d", port->portnum); - else if (is_xsigo_hca(port->remoteport->portguid)) + else if (ibnd_is_xsigo_hca(port->remoteport->guid)) fprintf(f, " (scp)"); fprintf(f, "\n"); @@ -651,278 +274,294 @@ out_switch_port(Port *port, int group) } void -out_ca_port(Port *port, int group) +out_ca_port(ibnd_port_t *port, int group) { char *str = NULL; char *rem_nodename = NULL; fprintf(f, "[%d]", port->portnum); - if (port->node->type != SWITCH_NODE) - fprintf(f, "(%" PRIx64 ") ", port->portguid); + if (port->node->info.type != IBND_SWITCH_NODE) + fprintf(f, "(%" PRIx64 ") ", port->guid); fprintf(f, "\t%s[%d]", node_name(port->remoteport->node), port->remoteport->portnum); str = out_ext_port(port->remoteport, group); if (str) fprintf(f, "%s", str); - if (port->remoteport->node->type != SWITCH_NODE) - fprintf(f, " (%" PRIx64 ") ", port->remoteport->portguid); + if (port->remoteport->node->info.type != IBND_SWITCH_NODE) + fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid); rem_nodename = remap_node_name(node_name_map, - port->remoteport->node->nodeguid, + port->remoteport->node->info.nodeguid, port->remoteport->node->nodedesc); fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n", - port->lid, port->lmc, rem_nodename, - port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); + port->info.base_lid, port->info.lmc, rem_nodename, + port->remoteport->node->info.type == IBND_SWITCH_NODE ? + port->remoteport->node->smalid : + port->remoteport->info.base_lid, + ibnd_linkwidth_str(port->info.link_width_active), + ibnd_linkspeed_str(port->info.link_speed_active, 0)); free(rem_nodename); } +struct iter_user_data { + int group; + int skip_chassis_nodes; +}; + +static void +switch_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("SWITCH: node %p\n", node); + + /* skip chassis based switches if flagged */ + if (data->skip_chassis_nodes && node->chassis && node->chassis->chassisnum) + return; + + out_switch(node, data->group, NULL); + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_switch_port(port, data->group); + } +} + +static void +ca_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("CA: node %p\n", node); + /* Now, skip chassis based CAs */ + if (data->group && node->chassis && node->chassis->chassisnum) + return; + out_ca(node, data->group, NULL); + + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, data->group); + } +} + +static void +router_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("RT: node %p\n", node); + /* Now, skip chassis based RTs */ + if (data->group && node->chassis && + node->chassis->chassisnum) + return; + out_ca(node, data->group, NULL); + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, data->group); + } +} + int -dump_topology(int listtype, int group) +dump_topology(int group, ibnd_fabric_t *fabric) { - Node *node; - Port *port; - int i = 0, dist = 0; + ibnd_node_t *node; + ibnd_port_t *port; + int i = 0, p = 0; time_t t = time(0); uint64_t chguid; char *chname = NULL; + struct iter_user_data iter_user_data; - if (!listtype) { - fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); - fprintf(f, "# Max of %d hops discovered\n", maxhops_discovered); - fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", mynode->nodeguid, mynode->portguid); - } + fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); + fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered); + fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", + fabric->from_node->info.nodeguid, fabric->from_node->info.nodeportguid); /* Make pass on switches */ - if (group && !listtype) { - ChassisList *ch = NULL; + if (group) { + ibnd_chassis_t *ch = NULL; /* Chassis based switches first */ - for (ch = chassis; ch; ch = ch->next) { + for (ch = fabric->chassis; ch; ch = ch->next) { int n = 0; if (!ch->chassisnum) continue; - chguid = out_chassis(ch->chassisnum); - if (chname) - free(chname); + chguid = out_chassis(fabric, ch->chassisnum); + chname = NULL; - if (is_xsigo_guid(chguid)) { - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { - if (!node->chrecord || - !node->chrecord->chassisnum) +/** + * Will this work for Xsigo? + */ + if (ibnd_is_xsigo_guid(chguid)) { + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (ibnd_is_xsigo_hca(node->info.nodeguid)) { + chname = node->nodedesc; + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); + } + } + +#if 0 +/** + * vs. this? + * I don't want to expose the nodesdist array to the end user. + */ + for (node = fabric->nodesdist[MAXHOPS]; node; node = node->dnext) { + if (!node->chassis || + !node->chassis->chassisnum) continue; - if (node->chrecord->chassisnum != ch->chassisnum) + if (node->chassis->chassisnum != ch->chassisnum) continue; - if (is_xsigo_hca(node->nodeguid)) { - chname = remap_node_name(node_name_map, - node->nodeguid, - node->nodedesc); - fprintf(f, "Hostname: %s\n", chname); + if (ibnd_is_xsigo_hca(node->nodeguid)) { + chname = node->nodedesc; + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); } } +#endif } fprintf(f, "\n# Spine Nodes"); - for (n = 1; n <= (SPINES_MAX_NUM+1); n++) { + for (n = 1; n <= SPINES_MAX_NUM; n++) { if (ch->spinenode[n]) { out_switch(ch->spinenode[n], group, chname); - for (port = ch->spinenode[n]->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= ch->spinenode[n]->info.numports; p++) { + port = ch->spinenode[n]->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); + } } } fprintf(f, "\n# Line Nodes"); - for (n = 1; n <= (LINES_MAX_NUM+1); n++) { + for (n = 1; n <= LINES_MAX_NUM; n++) { if (ch->linenode[n]) { out_switch(ch->linenode[n], group, chname); - for (port = ch->linenode[n]->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= ch->linenode[n]->info.numports; p++) { + port = ch->linenode[n]->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); + } } } fprintf(f, "\n# Chassis Switches"); - for (dist = 0; dist <= maxhops_discovered; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - /* Non Voltaire chassis */ - if (node->vendid == VTR_VENDOR_ID) - continue; - if (!node->chrecord || - !node->chrecord->chassisnum) - continue; - - if (node->chrecord->chassisnum != ch->chassisnum) - continue; - + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (node->info.type == IBND_SWITCH_NODE) { out_switch(node, group, chname); - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); - + } } - } fprintf(f, "\n# Chassis CAs"); - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { - if (!node->chrecord || - !node->chrecord->chassisnum) - continue; - - if (node->chrecord->chassisnum != ch->chassisnum) - continue; - - out_ca(node, group, chname); - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_ca_port(port, group); - + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (node->info.type == IBND_CA_NODE) { + out_ca(node, group, chname); + for (p = 1; p <= node->info.numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, group); + } + } } } - } else { - for (dist = 0; dist <= maxhops_discovered; dist++) { + } else { /* !group */ + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 0; - for (node = nodesdist[dist]; node; node = node->dnext) { - - DEBUG("SWITCH: dist %d node %p", dist, node); - if (!listtype) - out_switch(node, group, chname); - else { - if (listtype & LIST_SWITCH_NODE) - list_node(node); - continue; - } - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_switch_port(port, group); - } - } + ibnd_iter_nodes_type(fabric, switch_iter_func, + IBND_SWITCH_NODE, &iter_user_data); } - if (chname) - free(chname); chname = NULL; - if (group && !listtype) { + if (group) { + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 1; fprintf(f, "\nNon-Chassis Nodes\n"); - - for (dist = 0; dist <= maxhops_discovered; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - DEBUG("SWITCH: dist %d node %p", dist, node); - /* Now, skip chassis based switches */ - if (node->chrecord && - node->chrecord->chassisnum) - continue; - out_switch(node, group, chname); - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_switch_port(port, group); - } - - } + ibnd_iter_nodes_type(fabric, switch_iter_func, + IBND_SWITCH_NODE, &iter_user_data); } - /* Make pass on CAs */ - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 0; - DEBUG("CA: dist %d node %p", dist, node); - if (!listtype) { - /* Now, skip chassis based CAs */ - if (group && node->chrecord && - node->chrecord->chassisnum) - continue; - out_ca(node, group, chname); - } else { - if (((listtype & LIST_CA_NODE) && (node->type == CA_NODE)) || - ((listtype & LIST_ROUTER_NODE) && (node->type == ROUTER_NODE))) - list_node(node); - continue; - } - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_ca_port(port, group); - } + /* Make pass on CAs */ + ibnd_iter_nodes_type(fabric, ca_iter_func, IBND_CA_NODE, + &iter_user_data); - if (chname) - free(chname); + /* make pass on routers */ + ibnd_iter_nodes_type(fabric, router_iter_func, IBND_ROUTER_NODE, + &iter_user_data); return i; } -void dump_ports_report () + +void dump_ports_report (ibnd_node_t *node, void *user_data) { - int b, n = 0, p; - Node *node; - Port *port; - - // If switch and LID == 0, search of other switch ports with - // valid LID and assign it to all ports of that switch - for (b = 0; b <= MAXHOPS; b++) - for (node = nodesdist[b]; node; node = node->dnext) - if (node->type == SWITCH_NODE) { - int swlid = 0; - for (p = 0, port = node->ports; - p < node->numports && port && !swlid; - port = port->next) - if (port->lid != 0) - swlid = port->lid; - for (p = 0, port = node->ports; - p < node->numports && port; - port = port->next) - port->lid = swlid; - } + int p = 0; + ibnd_port_t *port = NULL; + + /* for each port */ + for (p = node->info.numports, port = node->ports[p]; + p > 0; + port = node->ports[--p]) { + if (port == NULL) + continue; - for (b = 0; b <= MAXHOPS; b++) - for (node = nodesdist[b]; node; node = node->dnext) { - for (p = 0, port = node->ports; - p < node->numports && port; - p++, port = port->next) { - fprintf(stdout, - "%2s %5d %2d 0x%016" PRIx64 " %s %s", - node_type_str2(port->node), port->lid, - port->portnum, - port->portguid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); - if (port->remoteport) - fprintf(stdout, - " - %2s %5d %2d 0x%016" PRIx64 - " ( '%s' - '%s' )\n", - node_type_str2(port->remoteport->node), - port->remoteport->lid, - port->remoteport->portnum, - port->remoteport->portguid, - port->node->nodedesc, - port->remoteport->node->nodedesc); - else - fprintf(stdout, "%36s'%s'\n", "", - port->node->nodedesc); - } - n++; - } + fprintf(stdout, + "%2s %5d %2d 0x%016" PRIx64 " %s %s", + ibnd_node_type_str_short(node), + node->info.type == IBND_SWITCH_NODE ? + node->smalid : port->info.base_lid, + port->portnum, + port->guid, + ibnd_linkwidth_str(port->info.link_width_active), + ibnd_linkspeed_str(port->info.link_speed_active, 0)); + if (port->remoteport) + fprintf(stdout, + " - %2s %5d %2d 0x%016" PRIx64 + " ( '%s' - '%s' )\n", + ibnd_node_type_str_short(port->remoteport->node), + port->remoteport->node->info.type == IBND_SWITCH_NODE ? + port->remoteport->node->smalid : + port->remoteport->info.base_lid, + port->remoteport->portnum, + port->remoteport->guid, + port->node->nodedesc, + port->remoteport->node->nodedesc); + else + fprintf(stdout, "%36s'%s'\n", "", + port->node->nodedesc); + } } void usage(void) { - fprintf(stderr, "Usage: %s [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port " + fprintf(stderr, "Usage: %s [-d(ebug)] -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port " "-t(imeout) timeout_ms --node-name-map node-name-map] -p(orts) []\n", argv0); fprintf(stderr, " --node-name-map specify a node name map file\n"); @@ -932,20 +571,18 @@ usage(void) int main(int argc, char **argv) { - int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; - ib_portid_t my_portid = {0}; - int udebug = 0, list = 0; + int list = 0; char *ca = 0; int ca_port = 0; int group = 0; int ports_report = 0; + ibnd_fabric_t *fabric = NULL; static char const str_opts[] = "C:P:t:devslgHSRpVhu"; static const struct option long_opts[] = { { "C", 1, 0, 'C'}, { "P", 1, 0, 'P'}, { "debug", 0, 0, 'd'}, - { "err_show", 0, 0, 'e'}, { "verbose", 0, 0, 'v'}, { "show", 0, 0, 's'}, { "list", 0, 0, 'l'}, @@ -981,23 +618,17 @@ main(int argc, char **argv) ca_port = strtoul(optarg, 0, 0); break; case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; + debug = 1; + ibnd_debug(1); break; case 't': - timeout = strtoul(optarg, 0, 0); + timeout_ms = strtoul(optarg, 0, 0); break; case 'v': verbose++; - dumplevel++; break; case 's': - dumplevel = 1; - break; - case 'e': - madrpc_show_errors(1); + ibnd_show_progress(1); break; case 'l': list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; @@ -1006,13 +637,13 @@ main(int argc, char **argv) group = 1; break; case 'S': - list = LIST_SWITCH_NODE; + list |= LIST_SWITCH_NODE; break; case 'H': - list = LIST_CA_NODE; + list |= LIST_CA_NODE; break; case 'R': - list = LIST_ROUTER_NODE; + list |= LIST_ROUTER_NODE; break; case 'V': fprintf(stderr, "%s %s\n", argv0, get_build_version() ); @@ -1029,22 +660,25 @@ main(int argc, char **argv) argv += optind; if (argc && !(f = fopen(argv[0], "w"))) - IBERROR("can't open file %s for writing", argv[0]); + fprintf(stderr, "can't open file %s for writing", argv[0]); - madrpc_init(ca, ca_port, mgmt_classes, 2); node_name_map = open_node_name_map(node_name_map_file); - if (discover(&my_portid) < 0) - IBERROR("discover"); - - if (group) - chassis = group_nodes(); + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } if (ports_report) - dump_ports_report(); + ibnd_iter_nodes(fabric, + dump_ports_report, + NULL); + else if (list) + list_nodes(fabric, list); else - dump_topology(list, group); + dump_topology(group, fabric); + ibnd_destroy_fabric(fabric); close_node_name_map(node_name_map); exit(0); } -- 1.5.4.5 From vlad at lists.openfabrics.org Sat Jan 10 03:11:56 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 10 Jan 2009 03:11:56 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090110-0200 daily build status Message-ID: <20090110111156.284F0E60E84@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From ronli at voltaire.com Sun Jan 11 00:19:32 2009 From: ronli at voltaire.com (Ron Livne) Date: Sun, 11 Jan 2009 10:19:32 +0200 Subject: [ofa-general] RE: About create_qp_flags merging In-Reply-To: References: Message-ID: a) I added this new verb in order to expose the creation flag IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK from the kernel to userspace. We want to let the user the option to choose it as a creation flag if they want. If you'll merge it in, I'll post a patch to support it in rdma_cm. Currently the only component using this flag is IPOIB. b) I know that XRC was submitted before my patches. However, since I don't know when the XRC will be merged, I reposted a series of patches that don't need the XRC in order to apply mine. Ron -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Friday, January 09, 2009 12:16 AM To: Ron Livne Cc: rolandd at cisco.com; general at lists.openfabrics.org; Olga Shern Subject: Re: About create_qp_flags merging > You've told me that you're not going to merge the create_qp_flags > patches because they are stuck behind XRC. > > However, I reposted new patches in December that don't rely on the XRC > patches. a) your patches look pointless, because they add a new userspace interface and then don't actually hook anything in libibverbs up to the new interface. b) is there any reason why your patches are so important that I should skip merging the XRC work that was submitted before yours and jump to apply yours first? - R. From vlad at lists.openfabrics.org Sun Jan 11 03:16:24 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 11 Jan 2009 03:16:24 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090111-0200 daily build status Message-ID: <20090111111624.47678E60F26@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From tziporet at dev.mellanox.co.il Sun Jan 11 03:57:49 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 11 Jan 2009 13:57:49 +0200 Subject: [ofa-general] Ethernet emulation on ConnectX card In-Reply-To: <49662A27.5020107@ext.bull.net> References: <5D49E7A8952DC44FB38C38FA0D758EAD0146618E@mtlexch01.mtl.com> <49662A27.5020107@ext.bull.net> Message-ID: <4969DEBD.6000600@mellanox.co.il> Celine Bourde wrote: > I've a ConnectX card and I would like to test Ethernet emulation on > both IB ports. > Is there any documentation on it ? > Should I modify the firmware .ini file to activate the functionality ? > > Any idea to proceed ? > > Look at mlx4_release_notes.txt to section 3. VPI (Virtual Protocol Interconnect) http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-docs/mlx4_release_notes.txt Tziporet From sashak at voltaire.com Sun Jan 11 12:49:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 11 Jan 2009 22:49:27 +0200 Subject: [ofa-general] [PATCH] opensm: update LFTs when entering master state Message-ID: <20090111204927.GG1441@sashak.voltaire.com> When we are going to setup LFTs we need to ignore its previous images if OpenSM enters master after standby, so need to check for subnet need_update flag too. Actually this should fix bug#1469. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_mgr.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index d332b36..96921a0 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -400,7 +400,7 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, for (block_id_ho = 0; osm_switch_get_lft_block(p_sw, block_id_ho, block); block_id_ho++) { - if (!p_sw->need_update && + if (!p_sw->need_update && !p_mgr->p_subn->need_update && !memcmp(block, p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, IB_SMP_DATA_SIZE)) -- 1.6.1.rc1.45.g123ed From kliteyn at dev.mellanox.co.il Sun Jan 11 13:42:48 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 11 Jan 2009 23:42:48 +0200 Subject: [ofa-general] Re: [PATCH] opensm: update LFTs when entering master state In-Reply-To: <20090111204927.GG1441@sashak.voltaire.com> References: <20090111204927.GG1441@sashak.voltaire.com> Message-ID: <496A67D8.9070308@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > When we are going to setup LFTs we need to ignore its previous images if > OpenSM enters master after standby, so need to check for subnet > need_update flag too. Nice catch. I think there will be a similar problem with cached routing too - need to invalidate the cache when SM enters master state. -- Yevgeny > Actually this should fix bug#1469. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_mgr.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index d332b36..96921a0 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -400,7 +400,7 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, > for (block_id_ho = 0; > osm_switch_get_lft_block(p_sw, block_id_ho, block); > block_id_ho++) { > - if (!p_sw->need_update && > + if (!p_sw->need_update && !p_mgr->p_subn->need_update && > !memcmp(block, > p_sw->new_lft + block_id_ho * IB_SMP_DATA_SIZE, > IB_SMP_DATA_SIZE)) From kliteyn at dev.mellanox.co.il Sun Jan 11 13:48:12 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 11 Jan 2009 23:48:12 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 is active In-Reply-To: <20090107162507.GG11759@sashak.voltaire.com> References: <495C8576.9060004@dev.mellanox.co.il> <20090107162507.GG11759@sashak.voltaire.com> Message-ID: <496A691C.2000809@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 10:57 Thu 01 Jan , Yevgeny Kliteynik wrote: >> When switch is coming up after reset, port 0 always reports >> logical state ACTIVE. >> OpenSM shouldn't clear sw->need_update flag because of port 0. >> >> Signed-off-by: Yevgeny Kliteynik > > Applied. Thanks. > > BTW is it legal from IBA point of view for switch to setup logical state > of port 0 without SM intervention after reset? Good question. I don't remember any special port 0 treatment in the spec WRT the port state... -- Yevgeny > Sasha > >> --- >> opensm/opensm/osm_port_info_rcv.c | 2 +- >> 1 files changed, 1 insertions(+), 1 deletions(-) >> >> diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c >> index 8763b87..02ad586 100644 >> --- a/opensm/opensm/osm_port_info_rcv.c >> +++ b/opensm/opensm/osm_port_info_rcv.c >> @@ -317,7 +317,7 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, >> } >> >> if (ib_port_info_get_port_state(p_pi) > IB_LINK_INIT && p_node->sw && >> - p_node->sw->need_update == 1) >> + p_node->sw->need_update == 1 && port_num != 0) >> p_node->sw->need_update = 0; >> >> if (p_physp->need_update) >> -- >> 1.5.1.4 >> > From vlad at lists.openfabrics.org Mon Jan 12 03:14:52 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 12 Jan 2009 03:14:52 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090112-0200 daily build status Message-ID: <20090112111452.96F63E60E42@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From PHF at zurich.ibm.com Mon Jan 12 08:18:15 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Mon, 12 Jan 2009 17:18:15 +0100 Subject: [ofa-general] Shared Protection Domain for iWARP Message-ID: Hello, I have a short user application question for the experts: I would like 2 QPs to be able to access a single shared MR. For that I need the QPs to be in the same PD. What is the right way to do that with OFED 1.4? MPA initiator: ************** ---- QP1: ---- 1. rdma_create_event_channel() 2. rdma_create_id(event_channel, cm_id_1, ...) 3. rdma_resolve_addr(cm_id_1, ...) 4. rdma_resolve_route(cm_id_1, ...) 5. pd_1 = ibv_alloc_pd(cm_id_1->verbs) 6. ibv_create_comp_channel(cm_id_1->verbs) 7. ibv_create_cq(cm_id_1->verbs, ..., comp_channel, ...) 8. rdma_create_qp(cm_id_1, pd_1, init_attrs) ---- QP2: (Variant A) ---- 2. rdma_create_id(event_channel, cm_id_2, ...) 3. rdma_resolve_addr(cm_id_2, ...) 4. rdma_resolve_route(cm_id_2, ...) 6. ibv_create_comp_channel(cm_id_2->verbs) 7. ibv_create_cq(cm_id_2->verbs, ..., comp_channel, ...) 8. rdma_create_qp(cm_id_2, pd_1, init_attrs) ---- QP2: (Variant B) ---- 6. ibv_create_comp_channel(cm_id_1->verbs) 7. ibv_create_cq(cm_id_1->verbs, ..., comp_channel, ...) 8. rdma_create_qp(cm_id_1, pd_1, init_attrs) The short question is: Do I need to create a new CM ID (including addr/route resolving) in order to be able to create a new QP with its own completion channel and CQ but with a PD which is shared with QP1 or do I reuse the CM ID, create a new completion channel / CQ on it and create a new QP based on the old CM ID and the PD which is shared with QP1? Many thanks for your advice, Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: From gmpc at sanger.ac.uk Mon Jan 12 08:49:09 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Mon, 12 Jan 2009 16:49:09 +0000 Subject: [ofa-general] Boot time detection of SRP devices In-Reply-To: References: Message-ID: <496B7485.7070606@sanger.ac.uk> Hi all, I am trying to write an init script to do automatic discovery of SRP devices during boot. My current strategy is to run srp_daemon in one-shot mode early on in the boot process (before LVM etc is started) and then start the srp_daemon in deamon mode later on in the boot when / is read/write. This works except for the case when an IB port is active but does not have any SRP devices attached; if srp daemon in run in one-shot mode it hangs forever: Can anyone suggest any workarounds? srp_daemon -o -e -c -n -i mlx4_0 -p 1 12/00/09 16:38:47 : bad MAD status (110) from lid 0 12/00/09 16:38:52 : bad MAD status (110) from lid 0 12/00/09 16:38:57 : bad MAD status (110) from lid 0 ... Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From sean.hefty at intel.com Mon Jan 12 09:27:30 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 12 Jan 2009 09:27:30 -0800 Subject: [ofa-general] Shared Protection Domain for iWARP In-Reply-To: References: Message-ID: <000201c974db$0c689490$424ce984@amr.corp.intel.com> >QP1: >---- >1. rdma_create_event_channel() >2. rdma_create_id(event_channel, cm_id_1, ...) >3. rdma_resolve_addr(cm_id_1, ...) >4. rdma_resolve_route(cm_id_1, ...) >5. pd_1 = ibv_alloc_pd(cm_id_1->verbs) >6. ibv_create_comp_channel(cm_id_1->verbs) >7. ibv_create_cq(cm_id_1->verbs, ..., comp_channel, ...) >8. rdma_create_qp(cm_id_1, pd_1, init_attrs) Do you plan on connecting the QPs together? If so, you'll need rdma_connect() here. >QP2: (Variant A) >---- >2. rdma_create_id(event_channel, cm_id_2, ...) >3. rdma_resolve_addr(cm_id_2, ...) >4. rdma_resolve_route(cm_id_2, ...) > >6. ibv_create_comp_channel(cm_id_2->verbs) >7. ibv_create_cq(cm_id_2->verbs, ..., comp_channel, ...) >8. rdma_create_qp(cm_id_2, pd_1, init_attrs) > >---- >QP2: (Variant B) >---- >6. ibv_create_comp_channel(cm_id_1->verbs) >7. ibv_create_cq(cm_id_1->verbs, ..., comp_channel, ...) >8. rdma_create_qp(cm_id_1, pd_1, init_attrs) I don't know if iWarp supports peer to peer connections. If not, then you'll need to change one of the sides to act as the passive (server) side for the connection. See the rdma_cm.7 man page for a flow outline. - Sean From vst at vlnb.net Mon Jan 12 11:24:15 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Mon, 12 Jan 2009 22:24:15 +0300 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <496687DA.6010707@harr.org> References: <48E386F6.5040502@fusionio.com> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> Message-ID: <496B98DF.4050305@vlnb.net> Cameron Harr, on 01/09/2009 02:10 AM wrote: > Vlad, > I've been working on this and can't for the life of me get the cpu > affinity working. I've modified the C file a little that you linked to > (it's pretty old), but I'm now getting a segfault. Unless you have any > other ideas, I might let this drop. Could you ask why cpu affinity doesn't work in linux-kernel at vger.kernel.org (CC me), please? The only thing I can do is to ask there myself.. > Also, I think my original question got lost in the main thread, which > was to find out why I could get double the performance exporting two > drives over SCST/SRP as opposed to raiding those two drives and > exporting them as one object. Hmm, I can't see where you asked it. I checked your original e-mail and didn't find it. Anyway, can you repeat your question? Unfortunately, from the above I can't understand it. Thanks, Vlad > Thanks for your help, > Cameron > > Vladislav Bolkhovitin wrote: >> Cameron Harr wrote: >>> New results, with markers. >>> ---- >>> type=randwrite bs=512 drives=1 scst_threads=1 srptthread=1 >>> iops=65612.40 >>> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >>> iops=54934.31 >>> type=randwrite bs=512 drives=2 scst_threads=1 srptthread=1 >>> iops=82514.57 >>> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >>> iops=79680.42 >>> type=randwrite bs=512 drives=1 scst_threads=2 srptthread=1 >>> iops=60439.73 >>> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >>> iops=51510.68 >>> type=randwrite bs=512 drives=2 scst_threads=2 srptthread=1 >>> iops=102735.07 >>> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >>> iops=78558.77 >>> type=randwrite bs=512 drives=1 scst_threads=3 srptthread=1 >>> iops=62941.35 >>> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >>> iops=51924.17 >>> type=randwrite bs=512 drives=2 scst_threads=3 srptthread=1 >>> iops=120961.39 >>> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >>> iops=75411.52 >>> type=randwrite bs=512 drives=1 scst_threads=1 srptthread=0 >>> iops=50891.13 >>> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >>> iops=50199.90 >>> type=randwrite bs=512 drives=2 scst_threads=1 srptthread=0 >>> iops=58711.87 >>> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >>> iops=74504.65 >>> type=randwrite bs=512 drives=1 scst_threads=2 srptthread=0 >>> iops=61043.73 >>> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >>> iops=49951.89 >>> type=randwrite bs=512 drives=2 scst_threads=2 srptthread=0 >>> iops=83195.60 >>> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >>> iops=75224.25 >>> type=randwrite bs=512 drives=1 scst_threads=3 srptthread=0 >>> iops=60277.98 >>> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >>> iops=49874.57 >>> type=randwrite bs=512 drives=2 scst_threads=3 srptthread=0 >>> iops=84851.43 >>> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >>> iops=73238.46 >> I think srptthread=0 performs worse in this case, because with it part >> of processing done in SIRQ, but seems scheduler make it be done on the >> same CPU as fct0-worker, which does the data transfer to your SSD >> device job. And this thread is always consumes about 100% CPU, so it >> has less CPU time, hence less overall performance. >> >> So, try to affine fctX-worker, SCST threads and SIRQ processing on >> different CPUs and check again. You can affine threads using utility >> from http://www.kernel.org/pub/linux/kernel/people/rml/cpu-affinity/, >> how to affine IRQ see Documentation/IRQ-affinity.txt in your kernel tree. >> >> Vlad >> > From alex.estrin at qlogic.com Mon Jan 12 11:42:54 2009 From: alex.estrin at qlogic.com (Alex Estrin) Date: Mon, 12 Jan 2009 13:42:54 -0600 Subject: [ofa-general] [PATCH] ipoib: failure during startup with non-default pkey set. In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624907@MNEXMB1.qlogic.org> References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624757@MNEXMB1.qlogic.org> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624907@MNEXMB1.qlogic.org> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E746249FB@MNEXMB1.qlogic.org> Hello, Here is a new version of the patch. Major points of change: 1. Delays interface initialization until port is Active. 2. Uses pkey-table first entry for main interface. 3. Keeps old algorithm for sub-interface (keeps manually assigned p_key and adjusts UD QP p_key index if sub-interface reinitialized). Please review. Signed-off-by: Alex Estrin --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 784c291..cd5639c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -51,6 +51,7 @@ MODULE_PARM_DESC(data_debug_level, #endif static DEFINE_MUTEX(pkey_mutex); +static void ipoib_pkey_dev_check_presence(struct net_device *dev); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, struct ib_pd *pd, struct ib_ah_attr *attr) @@ -676,12 +677,13 @@ int ipoib_ib_dev_open(struct net_device *dev) struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) { + ipoib_pkey_dev_check_presence(dev); + + if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { ipoib_warn(priv, "P_Key 0x%04x not found\n", priv->pkey); clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); return -1; } - set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); ret = ipoib_init_qp(dev); if (ret) { @@ -719,9 +721,26 @@ int ipoib_ib_dev_open(struct net_device *dev) static void ipoib_pkey_dev_check_presence(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - u16 pkey_index = 0; + struct ib_port_attr port_attr; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + if (ib_query_port(priv->ca, priv->port, &port_attr)) { + ipoib_warn(priv, "Query port attrs failed\n"); + return; + } + + if (port_attr.state != IB_PORT_ACTIVE) + return; + + if (ib_query_pkey(priv->ca, priv->port, 0, &priv->pkey)) { + ipoib_warn(priv, "Query P_Key table entry 0 failed\n"); + return; + } + set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); + } - if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index)) + if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); else set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); @@ -972,7 +991,8 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, } /* restart QP only if P_Key index is changed */ - if (test_and_set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) && + if (test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags) && + test_and_set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) && new_index == priv->pkey_index) { ipoib_dbg(priv, "Not flushing - P_Key index not changed.\n"); return; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 016a057..4d270e2 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -556,6 +556,13 @@ void ipoib_mcast_join_task(struct work_struct *work) } spin_lock_irq(&priv->lock); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { + /* fix broadcast gid in case if pkey was changed */ + priv->pkey |= 0x8000; + priv->dev->broadcast[8] = priv->pkey >> 8; + priv->dev->broadcast[9] = priv->pkey & 0xff; + } memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4, sizeof (union ib_gid)); priv->broadcast = broadcast; -------------- next part -------------- A non-text attachment was scrubbed... Name: ipoib_pkey_bootup_race_v2.patch Type: application/octet-stream Size: 3154 bytes Desc: ipoib_pkey_bootup_race_v2.patch URL: From cameron at harr.org Mon Jan 12 12:42:20 2009 From: cameron at harr.org (Cameron Harr) Date: Mon, 12 Jan 2009 13:42:20 -0700 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <496B98DF.4050305@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> Message-ID: <496BAB2C.1080405@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr, on 01/09/2009 02:10 AM wrote: >> Also, I think my original question got lost in the main thread, which >> was to find out why I could get double the performance exporting two >> drives over SCST/SRP as opposed to raiding those two drives and >> exporting them as one object. > > Hmm, I can't see where you asked it. I checked your original e-mail > and didn't find it. You're absolutely right. Looking back at the original email, I didn't really say the question. I had tried debugging the issue on my own and determined interrupts were the problem and made that be my question. I never got back to asking the base question. > > Anyway, can you repeat your question? Unfortunately, from the above I > can't understand it. > Back when I started the thread, I had the problem where performance seemed to follow the number of devices exported instead of the performance of the devices themselves. Specifically, I had two drives on a local box. If I did a RAID 0 on those two drives and exported them, I'd get performance X. However, if I exported both of the drives as individual drives, my performance was close to 2X. Just now, running with recent code, I ran a battery of tests to verify with real numbers, and to my pleasant surprise, the issue (which had been verified by multiple people at my work) seems to have vanished. I'm now getting equal performance out of 1 2-drive RAID SRP target as I am out of 2 independent SRP targets. So, congrats - you fixed my problem! From PHF at zurich.ibm.com Mon Jan 12 14:20:19 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Mon, 12 Jan 2009 23:20:19 +0100 Subject: [ofa-general] Shared Protection Domain for iWARP In-Reply-To: <000201c974db$0c689490$424ce984@amr.corp.intel.com> References: <000201c974db$0c689490$424ce984@amr.corp.intel.com> Message-ID: Sean, many thanks for your reply. I currently have my application up and running using a single QP on each of two hosts. What I want to do now is to establish a second connection using a second QP on each host but let both QPs access the same memory region. The above is just a toy example. In the end I want to connect my host to a couple of other hosts each with its own QP and have them all access 1 MR on the first host (star-topology). For that I need the MR to be accessible by all the remote QPs and therefore they all need to be in the same shared PD on the first host. I have tried Variant A but when I send from the MR which was registered by QP1 using QP2, I get an async event saying 'Local Work Queue Error' and the connection gets terminated. This is even though, QP1 and QP2 are in the same PD on the local host. Is there anything else I need to share between them or is there a fundamental misunderstanding on my side? Best regards, Philip "Sean Hefty" wrote on 01/12/2009 06:27:30 PM: > [image removed] > > RE: [ofa-general] Shared Protection Domain for iWARP > > Sean Hefty > > to: > > 'Philip Frey1', general > > 01/12/2009 06:27 PM > > >QP1: > >---- > >1. rdma_create_event_channel() > >2. rdma_create_id(event_channel, cm_id_1, ...) > >3. rdma_resolve_addr(cm_id_1, ...) > >4. rdma_resolve_route(cm_id_1, ...) > >5. pd_1 = ibv_alloc_pd(cm_id_1->verbs) > >6. ibv_create_comp_channel(cm_id_1->verbs) > >7. ibv_create_cq(cm_id_1->verbs, ..., comp_channel, ...) > >8. rdma_create_qp(cm_id_1, pd_1, init_attrs) > > Do you plan on connecting the QPs together? If so, you'll need rdma_connect() > here. > > >QP2: (Variant A) > >---- > >2. rdma_create_id(event_channel, cm_id_2, ...) > >3. rdma_resolve_addr(cm_id_2, ...) > >4. rdma_resolve_route(cm_id_2, ...) > > > >6. ibv_create_comp_channel(cm_id_2->verbs) > >7. ibv_create_cq(cm_id_2->verbs, ..., comp_channel, ...) > >8. rdma_create_qp(cm_id_2, pd_1, init_attrs) > > > >---- > >QP2: (Variant B) > >---- > >6. ibv_create_comp_channel(cm_id_1->verbs) > >7. ibv_create_cq(cm_id_1->verbs, ..., comp_channel, ...) > >8. rdma_create_qp(cm_id_1, pd_1, init_attrs) > > I don't know if iWarp supports peer to peer connections. If not, then you'll > need to change one of the sides to act as the passive (server) side for the > connection. See the rdma_cm.7 man page for a flow outline. > > - Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Mon Jan 12 14:28:40 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 12 Jan 2009 14:28:40 -0800 Subject: [ofa-general] Shared Protection Domain for iWARP In-Reply-To: References: <000201c974db$0c689490$424ce984@amr.corp.intel.com> Message-ID: <000001c97505$1ef44120$0d5a180a@amr.corp.intel.com> >I have tried Variant A but when I send from the MR which was registered by QP1 >using QP2, >I get an async event saying 'Local Work Queue Error' and the connection gets >terminated. >This is even though, QP1 and QP2 are in the same PD on the local host. Is there >anything else I need to >share between them or is there a fundamental misunderstanding on my side? I understand your example better now. Variant A is the option that you want. You don't need to create a new completion channel, or CQ even, but you will want separate rdma_cm_id's. You should be able to use the same MR on both QPs, as long as they're on the same PD. - Sean From cameron at harr.org Mon Jan 12 15:56:58 2009 From: cameron at harr.org (Cameron Harr) Date: Mon, 12 Jan 2009 16:56:58 -0700 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <496B98DF.4050305@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> Message-ID: <496BD8CA.7050503@harr.org> Vladislav Bolkhovitin wrote: >>> I think srptthread=0 performs worse in this case, because with it >>> part of processing done in SIRQ, but seems scheduler make it be done >>> on the same CPU as fct0-worker, which does the data transfer to your >>> SSD device job. And this thread is always consumes about 100% CPU, >>> so it has less CPU time, hence less overall performance. >>> >>> So, try to affine fctX-worker, SCST threads and SIRQ processing on >>> different CPUs and check again. You can affine threads using utility >>> from >>> http://www.kernel.org/pub/linux/kernel/people/rml/cpu-affinity/, how >>> to affine IRQ see Documentation/IRQ-affinity.txt in your kernel tree. I ran with the two fct-worker threads pinned to cpus 7,8, the scsi_tgt threads pinned to cpus 4, 5 or 6 and irqbalance pinned on cpus 1-3. I wasn't sure if I should play with the 8 ksoftirqd procs, since there is one process per cpu. From these results, I don't see a big difference, but would still give srpt thread=1 a slight performance advantage. type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=74990.87 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=84005.58 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=72369.04 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=91147.19 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=70463.27 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=91755.24 type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=68000.68 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=87982.08 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=73380.33 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=87223.54 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=70918.08 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=88843.35 From rdreier at cisco.com Mon Jan 12 19:23:53 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jan 2009 19:23:53 -0800 Subject: [ofa-general] Re: [PATCH v3] ipoib: do not join broadcast group if interface is brought down In-Reply-To: <49676F14.8050005@Voltaire.COM> (Yossi Etigin's message of "Fri, 09 Jan 2009 17:36:52 +0200") References: <495B2C60.6020008@Voltaire.COM> <49650FC2.4070509@Voltaire.COM> <49676F14.8050005@Voltaire.COM> Message-ID: yes, I like this much simpler approach... applied. From rdreier at cisco.com Mon Jan 12 19:37:08 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jan 2009 19:37:08 -0800 Subject: [ofa-general] Re: [PATCH v2] ipoib: fix a deadlock between ipoib start/stop and child interface create/delete In-Reply-To: <495B6B8B.50803@Voltaire.COM> (Yossi Etigin's message of "Wed, 31 Dec 2008 14:54:35 +0200") References: <495B6B8B.50803@Voltaire.COM> Message-ID: I think this almost works, but: > + list_for_each_entry(cpriv, &priv->child_intfs, list) { > + flags = cpriv->dev->flags; > + new_flags = (flags & ~IFF_UP) | iffup_value; > + if (flags != new_flags) { > + rtnl_lock(); > + dev_change_flags(cpriv->dev, new_flags); > + rtnl_unlock(); > + } > + } taking flags outside of the rtnl lock looks dubious to me, since it could change before we get to the dev_change_flags() call. Looking at all this old code, I have to wonder whether anyone is depending on bringing up the main interface also bringing up all the subinterfaces ... the simplest solution would be to let the subinterfaces be independent. Is there anything wrong with just deleting the code to bring subinterfaces up/down? - R. From rdreier at cisco.com Mon Jan 12 19:40:00 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 12 Jan 2009 19:40:00 -0800 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus Some fixes for 2.6.29-rc2: Harvey Harrison (1): RDMA/nes: Fix for NIPQUAD removal Randy Dunlap (1): IB/iser: Add dependency on INFINIBAND_ADDR_TRANS Roland Dreier (3): mlx4_core: Fix warning from min() IB/mlx4: Don't register IB device for adapters with no IB ports Merge branches 'ehca', 'ipoib', 'iser', 'mlx4' and 'nes' into for-next Stephen Rothwell (1): IB/ehca: spin_lock_irqsave() takes an unsigned long Yossi Etigin (2): IPoIB: Fix loss of connectivity after bonding failover on both sides IPoIB: Do not join broadcast group if interface is brought down drivers/infiniband/hw/ehca/ehca_main.c | 2 +- drivers/infiniband/hw/mlx4/main.c | 13 +++++-- drivers/infiniband/hw/nes/nes_cm.c | 12 +++++-- drivers/infiniband/hw/nes/nes_utils.c | 4 ++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 38 ++++++++++++------------ drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++ drivers/infiniband/ulp/iser/Kconfig | 2 +- drivers/net/mlx4/main.c | 4 +- 8 files changed, 47 insertions(+), 31 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index 3b77b67..c7b8a50 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -955,7 +955,7 @@ void ehca_poll_eqs(unsigned long data) struct ehca_eq *eq = &shca->eq; int max = 3; volatile u64 q_ofs, q_ofs2; - u64 flags; + unsigned long flags; spin_lock_irqsave(&eq->spinlock, flags); q_ofs = eq->ipz_queue.current_q_offset; spin_unlock_irqrestore(&eq->spinlock, flags); diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index dcefe1f..61588bd 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -543,14 +543,21 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) { static int mlx4_ib_version_printed; struct mlx4_ib_dev *ibdev; + int num_ports = 0; int i; - if (!mlx4_ib_version_printed) { printk(KERN_INFO "%s", mlx4_ib_version); ++mlx4_ib_version_printed; } + mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) + num_ports++; + + /* No point in registering a device with no ports... */ + if (num_ports == 0) + return NULL; + ibdev = (struct mlx4_ib_dev *) ib_alloc_device(sizeof *ibdev); if (!ibdev) { dev_err(&dev->pdev->dev, "Device struct alloc failed\n"); @@ -574,9 +581,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) ibdev->ib_dev.owner = THIS_MODULE; ibdev->ib_dev.node_type = RDMA_NODE_IB_CA; ibdev->ib_dev.local_dma_lkey = dev->caps.reserved_lkey; - ibdev->num_ports = 0; - mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) - ibdev->num_ports++; + ibdev->num_ports = num_ports; ibdev->ib_dev.phys_port_cnt = ibdev->num_ports; ibdev->ib_dev.num_comp_vectors = dev->caps.num_comp_vectors; ibdev->ib_dev.dma_device = &dev->pdev->dev; diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index a812db2..ca9ef3f 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -778,12 +778,13 @@ static struct nes_cm_node *find_node(struct nes_cm_core *cm_core, unsigned long flags; struct list_head *hte; struct nes_cm_node *cm_node; + __be32 tmp_addr = cpu_to_be32(loc_addr); /* get a handle on the hte */ hte = &cm_core->connected_nodes; nes_debug(NES_DBG_CM, "Searching for an owner node: %pI4:%x from core %p->%p\n", - &loc_addr, loc_port, cm_core, hte); + &tmp_addr, loc_port, cm_core, hte); /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->ht_lock, flags); @@ -816,6 +817,7 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; + __be32 tmp_addr = cpu_to_be32(dst_addr); /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); @@ -833,7 +835,7 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); nes_debug(NES_DBG_CM, "Unable to find listener for %pI4:%x\n", - &dst_addr, dst_port); + &tmp_addr, dst_port); /* no listener */ return NULL; @@ -2059,6 +2061,7 @@ static int mini_cm_recv_pkt(struct nes_cm_core *cm_core, struct tcphdr *tcph; struct nes_cm_info nfo; int skb_handled = 1; + __be32 tmp_daddr, tmp_saddr; if (!skb) return 0; @@ -2074,8 +2077,11 @@ static int mini_cm_recv_pkt(struct nes_cm_core *cm_core, nfo.rem_addr = ntohl(iph->saddr); nfo.rem_port = ntohs(tcph->source); + tmp_daddr = cpu_to_be32(iph->daddr); + tmp_saddr = cpu_to_be32(iph->saddr); + nes_debug(NES_DBG_CM, "Received packet: dest=%pI4:0x%04X src=%pI4:0x%04X\n", - &iph->daddr, tcph->dest, &iph->saddr, tcph->source); + &tmp_daddr, tcph->dest, &tmp_saddr, tcph->source); do { cm_node = find_node(cm_core, diff --git a/drivers/infiniband/hw/nes/nes_utils.c b/drivers/infiniband/hw/nes/nes_utils.c index aa9b734..6f3bc1b 100644 --- a/drivers/infiniband/hw/nes/nes_utils.c +++ b/drivers/infiniband/hw/nes/nes_utils.c @@ -655,6 +655,7 @@ int nes_arp_table(struct nes_device *nesdev, u32 ip_addr, u8 *mac_addr, u32 acti struct nes_adapter *nesadapter = nesdev->nesadapter; int arp_index; int err = 0; + __be32 tmp_addr; for (arp_index = 0; (u32) arp_index < nesadapter->arp_table_size; arp_index++) { if (nesadapter->arp_table[arp_index].ip_addr == ip_addr) @@ -682,8 +683,9 @@ int nes_arp_table(struct nes_device *nesdev, u32 ip_addr, u8 *mac_addr, u32 acti /* DELETE or RESOLVE */ if (arp_index == nesadapter->arp_table_size) { + tmp_addr = cpu_to_be32(ip_addr); nes_debug(NES_DBG_NETDEV, "MAC for %pI4 not in ARP table - cannot %s\n", - &ip_addr, action == NES_ARP_RESOLVE ? "resolve" : "delete"); + &tmp_addr, action == NES_ARP_RESOLVE ? "resolve" : "delete"); return -1; } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 19e06bc..dce0443 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -711,26 +711,26 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (neigh->ah) - if (unlikely((memcmp(&neigh->dgid.raw, - skb->dst->neighbour->ha + 4, - sizeof(union ib_gid))) || - (neigh->dev != dev))) { - spin_lock_irqsave(&priv->lock, flags); - /* - * It's safe to call ipoib_put_ah() inside - * priv->lock here, because we know that - * path->ah will always hold one more reference, - * so ipoib_put_ah() will never do more than - * decrement the ref count. - */ + if (unlikely((memcmp(&neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof(union ib_gid))) || + (neigh->dev != dev))) { + spin_lock_irqsave(&priv->lock, flags); + /* + * It's safe to call ipoib_put_ah() inside + * priv->lock here, because we know that + * path->ah will always hold one more reference, + * so ipoib_put_ah() will never do more than + * decrement the ref count. + */ + if (neigh->ah) ipoib_put_ah(neigh->ah); - list_del(&neigh->list); - ipoib_neigh_free(dev, neigh); - spin_unlock_irqrestore(&priv->lock, flags); - ipoib_path_lookup(skb, dev); - return NETDEV_TX_OK; - } + list_del(&neigh->list); + ipoib_neigh_free(dev, neigh); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_path_lookup(skb, dev); + return NETDEV_TX_OK; + } if (ipoib_cm_get(neigh)) { if (ipoib_cm_up(neigh)) { diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index a2eb3b9..59d02e0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -529,6 +529,9 @@ void ipoib_mcast_join_task(struct work_struct *work) if (!priv->broadcast) { struct ipoib_mcast *broadcast; + if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + return; + broadcast = ipoib_mcast_alloc(dev, 1); if (!broadcast) { ipoib_warn(priv, "failed to allocate broadcast group\n"); diff --git a/drivers/infiniband/ulp/iser/Kconfig b/drivers/infiniband/ulp/iser/Kconfig index 77dedba..b411c51 100644 --- a/drivers/infiniband/ulp/iser/Kconfig +++ b/drivers/infiniband/ulp/iser/Kconfig @@ -1,6 +1,6 @@ config INFINIBAND_ISER tristate "iSCSI Extensions for RDMA (iSER)" - depends on SCSI && INET + depends on SCSI && INET && INFINIBAND_ADDR_TRANS select SCSI_ISCSI_ATTRS ---help--- Support for the iSCSI Extensions for RDMA (iSER) Protocol diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 710c79e..6ef2490 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -912,8 +912,8 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) int i; if (msi_x) { - nreq = min(dev->caps.num_eqs - dev->caps.reserved_eqs, - num_possible_cpus() + 1); + nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs, + num_possible_cpus() + 1); entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL); if (!entries) goto no_msi; From anuj01 at gmail.com Mon Jan 12 20:41:06 2009 From: anuj01 at gmail.com (=?UTF-8?B?4KSF4KSo4KWB4KSc?=) Date: Tue, 13 Jan 2009 10:11:06 +0530 Subject: [ofa-general] ***SPAM*** server/client userspace application ibv_alloc_pd seg. fault Message-ID: Hi I am trying to run server client application in which server accepts two numbers from client and sends the sum of them to client. I am using ofed - 1.2 having libibverbs-1.1.1, libmthca-1.0.4 and librdmacm-1.0.1. The client exits at the ibv_alloc_pd() with segmentation fault. While I tried to create pd simply with ibv_alloc_pd() without using rdmacm, it's successful. The gdb output of client is : . . main : ibv_alloc_pd() Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 182896337024 (LWP 12634)] 0x0000002a95685ee5 in __ibv_alloc_pd (context=0x2a9578d07c) at src/verbs.c:143 143 src/verbs.c: No such file or directory. in src/verbs.c Any clues. Thanks in Advance -- Anuj Aggarwal .''`. : :Ⓐ : # apt-get install hakuna-matata `. `'` `- From dotanba at gmail.com Mon Jan 12 23:22:46 2009 From: dotanba at gmail.com (Dotan Barak) Date: Tue, 13 Jan 2009 09:22:46 +0200 Subject: [ofa-general] ***SPAM*** server/client userspace application ibv_alloc_pd seg. fault In-Reply-To: References: Message-ID: <2f3bf9a60901122322g4d6243c3ob4b7a82c656c82ba@mail.gmail.com> Maybe you are using the wrong device context... Did you open the device? You are welcome to send all of the code before calling ibv_alloc_pd, i think this will help understand what went wrong ... Dotan On Tue, Jan 13, 2009 at 6:41 AM, अनुज wrote: > Hi > > I am trying to run server client application in which server accepts > two numbers from client and sends the sum of them to client. > > I am using ofed - 1.2 having libibverbs-1.1.1, libmthca-1.0.4 and > librdmacm-1.0.1. > > The client exits at the ibv_alloc_pd() with segmentation fault. While > I tried to create pd simply with ibv_alloc_pd() without using rdmacm, > it's successful. > > The gdb output of client is : > > . > . > main : ibv_alloc_pd() > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 182896337024 (LWP 12634)] > 0x0000002a95685ee5 in __ibv_alloc_pd (context=0x2a9578d07c) at src/verbs.c:143 > 143 src/verbs.c: No such file or directory. > in src/verbs.c > > Any clues. > > Thanks in Advance > > > -- > Anuj Aggarwal > > .''`. > : :Ⓐ : # apt-get install hakuna-matata > `. `'` > `- > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From dorfman.eli at gmail.com Tue Jan 13 00:02:26 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 13 Jan 2009 10:02:26 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags Add support for PortXmitWait counter In-Reply-To: References: <4950E07F.6090104@gmail.com> <20090105123422.GD1494@sashak.voltaire.com> <49620157.50003@gmail.com> <49630C47.4000302@gmail.com> <694d48600901071230x31ca8bb0l55a8245f092d633c@mail.gmail.com> <20090108050139.GB13222@sashak.voltaire.com> <4965EEDA.6030504@gmail.com> Message-ID: <496C4A92.6060001@gmail.com> Hal Rosenstock wrote: > Eli. > > On Thu, Jan 8, 2009 at 7:17 AM, Eli Dorfman (Voltaire) > wrote: >> I understand that ClassPortInfo specifies whether PortXmitWait is supported. >> It does not distinguish between get and set operation. >> We have seen that ClassPortInfo to IS4 returns garbage. > > I don't have such a device. By garbage, do you mean that it responds > to Get CPI but the contents of the CPI response specifically > CapabilityMask field is garbage ? IS4 device response is the same as the request. ConnectX CA with fw 2.6.0 returns correct response, so I can test it with the CA. > If that is the case, I don't see a generalized fix for this and only a > specific workaround being possible (like device IS4 and firmware > version checks which is pretty ugly). > > Any other ideas ? > From vlad at lists.openfabrics.org Tue Jan 13 03:18:40 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 13 Jan 2009 03:18:40 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090113-0200 daily build status Message-ID: <20090113111841.17126E61051@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From vst at vlnb.net Tue Jan 13 03:58:27 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Tue, 13 Jan 2009 14:58:27 +0300 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <496BD8CA.7050503@harr.org> References: <48E386F6.5040502@fusionio.com> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> Message-ID: <496C81E3.2050105@vlnb.net> Cameron Harr, on 01/13/2009 02:56 AM wrote: > Vladislav Bolkhovitin wrote: >>>> I think srptthread=0 performs worse in this case, because with it >>>> part of processing done in SIRQ, but seems scheduler make it be done >>>> on the same CPU as fct0-worker, which does the data transfer to your >>>> SSD device job. And this thread is always consumes about 100% CPU, >>>> so it has less CPU time, hence less overall performance. >>>> >>>> So, try to affine fctX-worker, SCST threads and SIRQ processing on >>>> different CPUs and check again. You can affine threads using utility >>>> from >>>> http://www.kernel.org/pub/linux/kernel/people/rml/cpu-affinity/, how >>>> to affine IRQ see Documentation/IRQ-affinity.txt in your kernel tree. > > I ran with the two fct-worker threads pinned to cpus 7,8, the scsi_tgt > threads pinned to cpus 4, 5 or 6 and irqbalance pinned on cpus 1-3. I > wasn't sure if I should play with the 8 ksoftirqd procs, since there is > one process per cpu. From these results, I don't see a big difference, Hmm, you sent me before the following results: type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=54934.31 type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=50199.90 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=51510.68 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=49951.89 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=51924.17 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=49874.57 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=79680.42 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=74504.65 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=78558.77 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=75224.25 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=75411.52 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=73238.46 I see quite a big improvement. For instance, for drives=1 scst_threads=1 srptthread=1 case it is 36%. Or, do you use different hardware, so those results can't be compared? > but would still give srpt thread=1 a slight performance advantage. At this level CPU caches starting playing essential role. To get the maximum performance the commands processing of each command should use the same CPU L2+ cache(s), i.e. be done on the same physical CPU, but on different cores. Most likely, affinity assigned by you was worse, than the scheduler decisions. What's your CPU configuration? Please send me the top/vmstat output during tests from the target as well as your dmesg from the target just after it's booted. > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=74990.87 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=84005.58 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=72369.04 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=91147.19 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=70463.27 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=91755.24 > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=68000.68 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=87982.08 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=73380.33 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=87223.54 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=70918.08 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=88843.35 > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > Scst-devel mailing list > Scst-devel at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scst-devel > From dorfman.eli at gmail.com Tue Jan 13 06:29:13 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 13 Jan 2009 16:29:13 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v2 0/2] infiniband-diags Add support for PortXmitWait counter In-Reply-To: <4950E07F.6090104@gmail.com> References: <4950E07F.6090104@gmail.com> Message-ID: <496CA539.1040204@gmail.com> This patch adds support for PortXmitWait counter get and set (reset) From dorfman.eli at gmail.com Tue Jan 13 06:31:42 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 13 Jan 2009 16:31:42 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v2 1/2] libibmad add PortXmitWait and CounterSelect2 to fields. In-Reply-To: <496CA539.1040204@gmail.com> References: <4950E07F.6090104@gmail.com> <496CA539.1040204@gmail.com> Message-ID: <496CA5CE.3040804@gmail.com> add PortXmitWait counter to fields. add CounterSelect2 to fields to allow reset of PortXmitWait(MgtWg comment#4527) Signed-off-by: Eli Dorfman --- libibmad/include/infiniband/mad.h | 2 ++ libibmad/src/fields.c | 2 ++ 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 89b4be5..0a962c0 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -411,6 +411,7 @@ enum MAD_FIELDS { IB_PC_XMT_DISCARDS_F, IB_PC_ERR_XMTCONSTR_F, IB_PC_ERR_RCVCONSTR_F, + IB_PC_COUNTER_SELECT2_F, IB_PC_ERR_LOCALINTEG_F, IB_PC_ERR_EXCESS_OVR_F, IB_PC_VL15_DROPPED_F, @@ -418,6 +419,7 @@ enum MAD_FIELDS { IB_PC_RCV_BYTES_F, IB_PC_XMT_PKTS_F, IB_PC_RCV_PKTS_F, + IB_PC_XMT_WAIT_F, IB_PC_LAST_F, /* diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index 50611c5..5cebd01 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -239,6 +239,7 @@ static const ib_field_t ib_mad_f [] = { [IB_PC_XMT_DISCARDS_F] {BITSOFFS(112, 16), "XmtDiscards", mad_dump_uint}, [IB_PC_ERR_XMTCONSTR_F] {BITSOFFS(128, 8), "XmtConstraintErrors", mad_dump_uint}, [IB_PC_ERR_RCVCONSTR_F] {BITSOFFS(136, 8), "RcvConstraintErrors", mad_dump_uint}, + [IB_PC_COUNTER_SELECT2_F] {BITSOFFS(144, 8), "CounterSelect2", mad_dump_uint}, [IB_PC_ERR_LOCALINTEG_F] {BITSOFFS(152, 4), "LinkIntegrityErrors", mad_dump_uint}, [IB_PC_ERR_EXCESS_OVR_F] {BITSOFFS(156, 4), "ExcBufOverrunErrors", mad_dump_uint}, [IB_PC_VL15_DROPPED_F] {BITSOFFS(176, 16), "VL15Dropped", mad_dump_uint}, @@ -246,6 +247,7 @@ static const ib_field_t ib_mad_f [] = { [IB_PC_RCV_BYTES_F] {224, 32, "RcvData", mad_dump_uint}, [IB_PC_XMT_PKTS_F] {256, 32, "XmtPkts", mad_dump_uint}, [IB_PC_RCV_PKTS_F] {288, 32, "RcvPkts", mad_dump_uint}, + [IB_PC_XMT_WAIT_F] {320, 32, "XmtWait", mad_dump_uint}, /* * SMInfo -- 1.5.5 From dorfman.eli at gmail.com Tue Jan 13 06:32:34 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 13 Jan 2009 16:32:34 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v2 2/2] infiniband-diags support PortXmitWait get and set In-Reply-To: <496CA539.1040204@gmail.com> References: <4950E07F.6090104@gmail.com> <496CA539.1040204@gmail.com> Message-ID: <496CA602.1060600@gmail.com> support PortXmitWait get and set Signed-off-by: Eli Dorfman --- infiniband-diags/src/perfquery.c | 14 +++++++++++++- libibmad/src/gs.c | 2 ++ 2 files changed, 15 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 41a8b74..5e5b3ed 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -67,6 +67,7 @@ struct perf_count { uint32_t rcvdata; uint32_t xmtpkts; uint32_t rcvpkts; + uint32_t xmtwait; }; struct perf_count_ext { @@ -209,6 +210,8 @@ static void aggregate_perfcounters(void) aggregate_32bit(&perf_count.xmtpkts, val); mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); aggregate_32bit(&perf_count.rcvpkts, val); + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); + aggregate_32bit(&perf_count.xmtwait, val); } static void output_aggregate_perfcounters(ib_portid_t *portid) @@ -235,6 +238,7 @@ static void output_aggregate_perfcounters(ib_portid_t *portid) mad_encode_field(pc, IB_PC_RCV_BYTES_F, &perf_count.rcvdata); mad_encode_field(pc, IB_PC_XMT_PKTS_F, &perf_count.xmtpkts); mad_encode_field(pc, IB_PC_RCV_PKTS_F, &perf_count.rcvpkts); + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); @@ -298,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p if (extended != 1) { if (!port_performance_query(pc, portid, port, timeout)) IBERROR("perfquery"); + if (!(cap_mask & 0x1000)) { + /* if PortCounters:PortXmitWait not suppported clear this counter */ + perf_count.xmtwait = 0; + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); + } if (aggregate) aggregate_perfcounters(); - else + else mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); } else { if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ @@ -500,6 +509,9 @@ main(int argc, char **argv) do_reset: + if (!extended && (cap_mask & 0x1000)) + mask |= (1<<16); /* reset portxmitwait */ + if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { for (i = start_port; i <= num_ports; i++) reset_counters(extended, timeout, mask, &portid, i); diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index d350c0d..30f00fb 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -142,6 +142,8 @@ performance_reset_via(void *rcvbuf, ib_portid_t *dest, int port, unsigned mask, /* Same for attribute IDs */ mad_set_field(rcvbuf, 0, IB_PC_PORT_SELECT_F, port); mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT_F, mask); + mask = mask >> 16; + mas_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT2_F, mask); rpc.attr.mod = 0; rpc.timeout = timeout; rpc.datasz = IB_PC_DATA_SZ; -- 1.5.5 From dorfman.eli at gmail.com Tue Jan 13 06:56:57 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 13 Jan 2009 16:56:57 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v3 2/2] infiniband-diags support PortXmitWait get and set In-Reply-To: <496CA602.1060600@gmail.com> References: <4950E07F.6090104@gmail.com> <496CA539.1040204@gmail.com> <496CA602.1060600@gmail.com> Message-ID: <496CABB9.4030402@gmail.com> support PortXmitWait get and set fix syntax error Signed-off-by: Eli Dorfman --- infiniband-diags/src/perfquery.c | 14 +++++++++++++- libibmad/src/gs.c | 2 ++ 2 files changed, 15 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 41a8b74..5e5b3ed 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -67,6 +67,7 @@ struct perf_count { uint32_t rcvdata; uint32_t xmtpkts; uint32_t rcvpkts; + uint32_t xmtwait; }; struct perf_count_ext { @@ -209,6 +210,8 @@ static void aggregate_perfcounters(void) aggregate_32bit(&perf_count.xmtpkts, val); mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); aggregate_32bit(&perf_count.rcvpkts, val); + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); + aggregate_32bit(&perf_count.xmtwait, val); } static void output_aggregate_perfcounters(ib_portid_t *portid) @@ -235,6 +238,7 @@ static void output_aggregate_perfcounters(ib_portid_t *portid) mad_encode_field(pc, IB_PC_RCV_BYTES_F, &perf_count.rcvdata); mad_encode_field(pc, IB_PC_XMT_PKTS_F, &perf_count.xmtpkts); mad_encode_field(pc, IB_PC_RCV_PKTS_F, &perf_count.rcvpkts); + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); @@ -298,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p if (extended != 1) { if (!port_performance_query(pc, portid, port, timeout)) IBERROR("perfquery"); + if (!(cap_mask & 0x1000)) { + /* if PortCounters:PortXmitWait not suppported clear this counter */ + perf_count.xmtwait = 0; + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); + } if (aggregate) aggregate_perfcounters(); - else + else mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); } else { if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ @@ -500,6 +509,9 @@ main(int argc, char **argv) do_reset: + if (!extended && (cap_mask & 0x1000)) + mask |= (1<<16); /* reset portxmitwait */ + if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { for (i = start_port; i <= num_ports; i++) reset_counters(extended, timeout, mask, &portid, i); diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index d350c0d..30f00fb 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -142,6 +142,8 @@ performance_reset_via(void *rcvbuf, ib_portid_t *dest, int port, unsigned mask, /* Same for attribute IDs */ mad_set_field(rcvbuf, 0, IB_PC_PORT_SELECT_F, port); mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT_F, mask); + mask = mask >> 16; + mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT2_F, mask); rpc.attr.mod = 0; rpc.timeout = timeout; rpc.datasz = IB_PC_DATA_SZ; -- 1.5.5 From monis at Voltaire.COM Tue Jan 13 07:44:30 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 13 Jan 2009 17:44:30 +0200 Subject: [ofa-general] Resending patches for 2.6.29 Message-ID: <496CB6DE.5090100@Voltaire.COM> Hi I am resending the following patches for review. * mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE * ib_mthca: Fix dispatch of IB_EVENT_LID_CHANGE * IPoIB: refresh paths that migh be invalid The are relative for branch for-next in your tree thanks MoniS From monis at Voltaire.COM Tue Jan 13 07:49:10 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 13 Jan 2009 17:49:10 +0200 Subject: [ofa-general] [PATCH] mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: <496CB6DE.5090100@Voltaire.COM> References: <496CB6DE.5090100@Voltaire.COM> Message-ID: <496CB7F6.7060204@Voltaire.COM> When snooping a portinfo MAD, it's client_reregister bit is checked. If the bit is ON then a CLIENT_REREGISTER event is dispatched, otherwise a LID_CHANGE event is dispatched. This way of decision ignores the cases where the MAD changes the LID along with an instruction to reregister (so a necessary LID_CHANGE event won't be dispatched) or the MAD is neither of these (and an unnecessary LID_CHANGE event will be dispatched). This patch dispatches an event if the client_reregister bit is set. In addition, the patch compares the LID in the MAD to the current LID. If and only if they are not identical then a LID_CHANGE event is dispatched. From: Moni Shoua Signed-off-by: Moni Shoua Signed-off-by: Jack Morgenstein Signed-off-by: Yossi Etigin --- drivers/infiniband/hw/mlx4/mad.c | 25 +++++++++++++++++++------ 1 files changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c index 606f1e2..0f0d945 100644 --- a/drivers/infiniband/hw/mlx4/mad.c +++ b/drivers/infiniband/hw/mlx4/mad.c @@ -147,7 +147,8 @@ static void update_sm_ah(struct mlx4_ib_dev *dev, u8 port_num, u16 lid, u8 sl) * Snoop SM MADs for port info and P_Key table sets, so we can * synthesize LID change and P_Key change events. */ -static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad) +static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad, + u16 prev_lid) { struct ib_event event; @@ -157,6 +158,7 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad) if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { struct ib_port_info *pinfo = (struct ib_port_info *) ((struct ib_smp *) mad)->data; + u16 lid = be16_to_cpu(pinfo->lid); update_sm_ah(to_mdev(ibdev), port_num, be16_to_cpu(pinfo->sm_lid), @@ -165,12 +167,15 @@ static void smp_snoop(struct ib_device *ibdev, u8 port_num, struct ib_mad *mad) event.device = ibdev; event.element.port_num = port_num; - if (pinfo->clientrereg_resv_subnetto & 0x80) + if (pinfo->clientrereg_resv_subnetto & 0x80) { event.event = IB_EVENT_CLIENT_REREGISTER; - else + ib_dispatch_event(&event); + } + if (prev_lid != lid) { event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); + } - ib_dispatch_event(&event); } if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) { @@ -228,8 +233,9 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, struct ib_wc *in_wc, struct ib_grh *in_grh, struct ib_mad *in_mad, struct ib_mad *out_mad) { - u16 slid; + u16 slid, prev_lid = 0; int err; + struct ib_port_attr pattr; slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE); @@ -263,6 +269,13 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, } else return IB_MAD_RESULT_SUCCESS; + if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + in_mad->mad_hdr.method == IB_MGMT_METHOD_SET && + in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO && + !ib_query_port(ibdev, port_num, &pattr)) + prev_lid = pattr.lid; + err = mlx4_MAD_IFC(to_mdev(ibdev), mad_flags & IB_MAD_IGNORE_MKEY, mad_flags & IB_MAD_IGNORE_BKEY, @@ -271,7 +284,7 @@ int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, return IB_MAD_RESULT_FAILURE; if (!out_mad->mad_hdr.status) { - smp_snoop(ibdev, port_num, in_mad); + smp_snoop(ibdev, port_num, in_mad, prev_lid); node_desc_override(ibdev, out_mad); } -- 1.5.5 From monis at Voltaire.COM Tue Jan 13 07:50:14 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 13 Jan 2009 17:50:14 +0200 Subject: [ofa-general] [PATCH] ib_mthca: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: <496CB6DE.5090100@Voltaire.COM> References: <496CB6DE.5090100@Voltaire.COM> Message-ID: <496CB836.50503@Voltaire.COM> When snooping a portinfo MAD, it's client_reregister bit is checked. If the bit is ON then a CLIENT_REREGISTER event is dispatched, otherwise a LID_CHANGE event is dispatched. This way of decision ignores the cases where the MAD changes the LID along with an instruction to reregister (so a necessary LID_CHANGE event won't be dispatched) or the MAD is neither of these (and an unnecessary LID_CHANGE event will be dispatched). This patch dispatches an event if the client_reregister bit is set. In addition, the patch compares the LID in the MAD to the current LID. If and only if they are not identical then a LID_CHANGE event is dispatched. From: Moni Shoua Signed-off-by: Moni Shoua Signed-off-by: Jack Morgenstein Signed-off-by: Yossi Etigin --- drivers/infiniband/hw/mthca/mthca_mad.c | 23 ++++++++++++++++++----- 1 files changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_mad.c b/drivers/infiniband/hw/mthca/mthca_mad.c index 6404495..107761d 100644 --- a/drivers/infiniband/hw/mthca/mthca_mad.c +++ b/drivers/infiniband/hw/mthca/mthca_mad.c @@ -104,7 +104,8 @@ static void update_sm_ah(struct mthca_dev *dev, */ static void smp_snoop(struct ib_device *ibdev, u8 port_num, - struct ib_mad *mad) + struct ib_mad *mad, + u16 prev_lid) { struct ib_event event; @@ -114,6 +115,7 @@ static void smp_snoop(struct ib_device *ibdev, if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO) { struct ib_port_info *pinfo = (struct ib_port_info *) ((struct ib_smp *) mad)->data; + u16 lid = be16_to_cpu(pinfo->lid); mthca_update_rate(to_mdev(ibdev), port_num); update_sm_ah(to_mdev(ibdev), port_num, @@ -123,12 +125,15 @@ static void smp_snoop(struct ib_device *ibdev, event.device = ibdev; event.element.port_num = port_num; - if (pinfo->clientrereg_resv_subnetto & 0x80) + if (pinfo->clientrereg_resv_subnetto & 0x80) { event.event = IB_EVENT_CLIENT_REREGISTER; - else + ib_dispatch_event(&event); + } + if (prev_lid != lid) { event.event = IB_EVENT_LID_CHANGE; + ib_dispatch_event(&event); + } - ib_dispatch_event(&event); } if (mad->mad_hdr.attr_id == IB_SMP_ATTR_PKEY_TABLE) { @@ -196,6 +201,8 @@ int mthca_process_mad(struct ib_device *ibdev, int err; u8 status; u16 slid = in_wc ? in_wc->slid : be16_to_cpu(IB_LID_PERMISSIVE); + u16 prev_lid = 0; + struct ib_port_attr pattr; /* Forward locally generated traps to the SM */ if (in_mad->mad_hdr.method == IB_MGMT_METHOD_TRAP && @@ -233,6 +240,12 @@ int mthca_process_mad(struct ib_device *ibdev, return IB_MAD_RESULT_SUCCESS; } else return IB_MAD_RESULT_SUCCESS; + if ((in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED || + in_mad->mad_hdr.mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) && + in_mad->mad_hdr.method == IB_MGMT_METHOD_SET && + in_mad->mad_hdr.attr_id == IB_SMP_ATTR_PORT_INFO && + !ib_query_port(ibdev, port_num, &pattr)) + prev_lid = pattr.lid; err = mthca_MAD_IFC(to_mdev(ibdev), mad_flags & IB_MAD_IGNORE_MKEY, @@ -252,7 +265,7 @@ int mthca_process_mad(struct ib_device *ibdev, } if (!out_mad->mad_hdr.status) { - smp_snoop(ibdev, port_num, in_mad); + smp_snoop(ibdev, port_num, in_mad, prev_lid); node_desc_override(ibdev, out_mad); } -- 1.5.5 From monis at Voltaire.COM Tue Jan 13 07:51:56 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 13 Jan 2009 17:51:56 +0200 Subject: [ofa-general] [PATCH] IPoIB: refresh paths that might be invalid In-Reply-To: <496CB6DE.5090100@Voltaire.COM> References: <496CB6DE.5090100@Voltaire.COM> Message-ID: <496CB89C.7020108@Voltaire.COM> If a standby SM takes over and if only some of the nodes change their LID as a result, the other nodes get an IPOIB_FLUSH_LIGHT event on the that doesn't cause flushing of paths but only marks them as probably invalid. Path refresh will happen only after an ARP probe which may take some time (tens of seconds). This patch adds a task that is responsible to restart the lookup of possibly invalid paths in 2 occasions: handling IPOIB_FLUSH_LIGHT event and when path completion returned with bad status. Signed-off-by: Moni Shoua --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +++- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 - drivers/infiniband/ulp/ipoib/ipoib_main.c | 38 ++++++++++++++++++++++++++---- 3 files changed, 38 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 753a983..89f574c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -298,6 +298,7 @@ struct ipoib_dev_priv { struct work_struct flush_heavy; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct delayed_work path_refresh_task; struct ib_device *ca; u8 port; @@ -378,7 +379,7 @@ struct ipoib_path { struct rb_node rb_node; struct list_head list; - int valid; + u8 stale; }; struct ipoib_neigh { @@ -442,8 +443,9 @@ int ipoib_add_umcast_attr(struct net_device *dev); void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); void ipoib_reap_ah(struct work_struct *work); +void ipoib_refresh_paths(struct work_struct *work); -void ipoib_mark_paths_invalid(struct net_device *dev); +void ipoib_mark_paths_stale(struct net_device *dev); void ipoib_flush_paths(struct net_device *dev); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index a192581..1a50076 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -962,7 +962,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, } if (level == IPOIB_FLUSH_LIGHT) { - ipoib_mark_paths_invalid(dev); + ipoib_mark_paths_stale(dev); ipoib_mcast_dev_flush(dev); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index dce0443..c4e227a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -352,7 +352,7 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter, #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ -void ipoib_mark_paths_invalid(struct net_device *dev) +void ipoib_mark_paths_stale(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path, *tp; @@ -360,11 +360,15 @@ void ipoib_mark_paths_invalid(struct net_device *dev) spin_lock_irq(&priv->lock); list_for_each_entry_safe(path, tp, &priv->path_list, list) { - ipoib_dbg(priv, "mark path LID 0x%04x GID %pI6 invalid\n", + ipoib_dbg(priv, "mark path LID 0x%04x GID %pI6 stale\n", be16_to_cpu(path->pathrec.dlid), path->pathrec.dgid.raw); - path->valid = 0; + path->stale = 0; } + if (!list_empty(&priv->path_list)) + queue_delayed_work(ipoib_workqueue, &priv->path_refresh_task, + round_jiffies_relative(HZ)); + spin_unlock_irq(&priv->lock); } @@ -427,6 +431,10 @@ static void path_rec_completion(int status, if (!ib_init_ah_from_path(priv->ca, priv->port, pathrec, &av)) ah = ipoib_create_ah(dev, priv->pd, &av); + } else { + path->stale = 1; + queue_delayed_work(ipoib_workqueue, &priv->path_refresh_task, + round_jiffies_relative(HZ)); } spin_lock_irqsave(&priv->lock, flags); @@ -477,7 +485,6 @@ static void path_rec_completion(int status, while ((skb = __skb_dequeue(&neigh->queue))) __skb_queue_tail(&skqueue, skb); } - path->valid = 1; } path->query = NULL; @@ -551,9 +558,29 @@ static int path_rec_start(struct net_device *dev, return path->query_id; } + path->stale = 0; return 0; } +void ipoib_refresh_paths(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, path_refresh_task.work); + struct net_device *dev = priv->dev; + struct ipoib_path *path, *tp; + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(path, tp, &priv->path_list, list) { + ipoib_dbg(priv, "restart path LID 0x%04x GID %pI6\n", + be16_to_cpu(path->pathrec.dlid), + path->pathrec.dgid.raw); + if (path->stale) + path_rec_start(dev, path); + } + + spin_unlock_irq(&priv->lock); +} + static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -656,7 +683,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, spin_lock_irqsave(&priv->lock, flags); path = __path_find(dev, phdr->hwaddr + 4); - if (!path || !path->valid) { + if (!path) { if (!path) path = path_rec_create(dev, phdr->hwaddr + 4); if (path) { @@ -1070,6 +1097,7 @@ static void ipoib_setup(struct net_device *dev) INIT_WORK(&priv->flush_heavy, ipoib_ib_dev_flush_heavy); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); + INIT_DELAYED_WORK(&priv->path_refresh_task, ipoib_refresh_paths); } struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) From dorfman.eli at gmail.com Tue Jan 13 07:58:28 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 13 Jan 2009 17:58:28 +0200 Subject: [ofa-general] ***SPAM*** [PATCH v2 0/2] opensm: Add new partition keyword for all hca, switches and routers In-Reply-To: <20090107163521.GH11759@sashak.voltaire.com> References: <495CE3F8.9080506@gmail.com> <20090107163521.GH11759@sashak.voltaire.com> Message-ID: <496CBA24.2080902@gmail.com> Add new partition keywords for all hca, switches and routers From dorfman.eli at gmail.com Tue Jan 13 08:00:17 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 13 Jan 2009 18:00:17 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v2 1/2] opensm: Add new partition keyword for all hca, switches and routers In-Reply-To: <496CBA24.2080902@gmail.com> References: <495CE3F8.9080506@gmail.com> <20090107163521.GH11759@sashak.voltaire.com> <496CBA24.2080902@gmail.com> Message-ID: <496CBA91.9000008@gmail.com> Add new partition keyword for node type groups. The following new keywords were added: 'ALL_CAS' means all CA end ports in the subnet. 'ALL_SWITCHES' means all Switch end ports in the subnet 'ALL_ROUTERS' means all Router end ports in the subnet For example, to allow firmware upgrade within managed switches we need that all switch port 0 to have full membership. Signed-off-by: Eli Dorfman --- opensm/opensm/osm_prtn.c | 15 +++++++++------ opensm/opensm/osm_prtn_config.c | 13 +++++++++++-- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/opensm/opensm/osm_prtn.c b/opensm/opensm/osm_prtn.c index be51410..8b9301e 100644 --- a/opensm/opensm/osm_prtn.c +++ b/opensm/opensm/osm_prtn.c @@ -135,7 +135,7 @@ ib_api_status_t osm_prtn_add_port(osm_log_t * p_log, osm_subn_t * p_subn, } ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, osm_subn_t * p_subn, - osm_prtn_t * p, boolean_t full) + osm_prtn_t * p, uint8_t type, boolean_t full) { cl_qmap_t *p_port_tbl = &p_subn->port_guid_tbl; cl_map_item_t *p_item; @@ -146,10 +146,13 @@ ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, osm_subn_t * p_subn, while (p_item != cl_qmap_end(p_port_tbl)) { p_port = (osm_port_t *) p_item; p_item = cl_qmap_next(p_item); - status = osm_prtn_add_port(p_log, p_subn, p, - osm_port_get_guid(p_port), full); - if (status != IB_SUCCESS) - goto _err; + if (type == 0xff || + (osm_node_get_type(p_port->p_node) == type)) { + status = osm_prtn_add_port(p_log, p_subn, p, + osm_port_get_guid(p_port), full); + if (status != IB_SUCCESS) + goto _err; + } } _err: @@ -325,7 +328,7 @@ static ib_api_status_t osm_prtn_make_default(osm_log_t * const p_log, IB_DEFAULT_PARTIAL_PKEY); if (!p) goto _err; - status = osm_prtn_add_all(p_log, p_subn, p, no_config); + status = osm_prtn_add_all(p_log, p_subn, p, 0xff, no_config); if (status != IB_SUCCESS) goto _err; cl_map_remove(&p->part_guid_tbl, p_subn->sm_port_guid); diff --git a/opensm/opensm/osm_prtn_config.c b/opensm/opensm/osm_prtn_config.c index 9511608..cb67377 100644 --- a/opensm/opensm/osm_prtn_config.c +++ b/opensm/opensm/osm_prtn_config.c @@ -64,7 +64,7 @@ extern osm_prtn_t *osm_prtn_make_new(osm_log_t * p_log, osm_subn_t * p_subn, const char *name, uint16_t pkey); extern ib_api_status_t osm_prtn_add_all(osm_log_t * p_log, osm_subn_t * p_subn, - osm_prtn_t * p, boolean_t full); + osm_prtn_t * p, uint8_t type, boolean_t full); extern ib_api_status_t osm_prtn_add_port(osm_log_t * p_log, osm_subn_t * p_subn, osm_prtn_t * p, ib_net64_t guid, boolean_t full); @@ -212,7 +212,16 @@ static int partition_add_port(unsigned lineno, struct part_conf *conf, if (!strncmp(name, "ALL", strlen(name))) { return osm_prtn_add_all(conf->p_log, conf->p_subn, p, - full) == IB_SUCCESS ? 0 : -1; + 0xff, full) == IB_SUCCESS ? 0 : -1; + } else if (!strncmp(name, "ALL_CAS", strlen(name))) { + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, + IB_NODE_TYPE_CA, full) == IB_SUCCESS ? 0 : -1; + } else if (!strncmp(name, "ALL_SWITCHES", strlen(name))) { + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, + IB_NODE_TYPE_SWITCH, full) == IB_SUCCESS ? 0 : -1; + } else if (!strncmp(name, "ALL_ROUTERS", strlen(name))) { + return osm_prtn_add_all(conf->p_log, conf->p_subn, p, + IB_NODE_TYPE_ROUTER, full) == IB_SUCCESS ? 0 : -1; } else if (!strncmp(name, "SELF", strlen(name))) { guid = cl_ntoh64(conf->p_subn->sm_port_guid); } else { -- 1.5.5 From dorfman.eli at gmail.com Tue Jan 13 08:01:19 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 13 Jan 2009 18:01:19 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v2 2/2] docs update documenatation about new partition keywords In-Reply-To: <496CBA24.2080902@gmail.com> References: <495CE3F8.9080506@gmail.com> <20090107163521.GH11759@sashak.voltaire.com> <496CBA24.2080902@gmail.com> Message-ID: <496CBACF.7030907@gmail.com> update documenatation about new partition keywords 'ALL_CAS', 'ALL_SWITCHES', 'ALL_ROUTERS' Signed-off-by: Eli Dorfman --- opensm/doc/partition-config.txt | 5 ++++- opensm/man/opensm.8.in | 5 ++++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/opensm/doc/partition-config.txt b/opensm/doc/partition-config.txt index 602cc66..5fcc317 100644 --- a/opensm/doc/partition-config.txt +++ b/opensm/doc/partition-config.txt @@ -67,6 +67,9 @@ full or - indicates full or limited membership for this port. When There are two useful keywords for PortGUID definition: - 'ALL' means all end ports in this subnet. +- 'ALL_CAS' means all Channel Adapter end ports in this subnet. +- 'ALL_SWITCHES' means all Switch end ports in this subnet. +- 'ALL_ROUTERS' means all Router end ports in this subnet. - 'SELF' means subnet manager's port. Empty list means no ports in this partition. @@ -92,7 +95,7 @@ different PKey values will be generated for those definitions). Examples: -------- -Default=0x7fff : ALL, SELF=full ; +Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ; NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ; diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index eedd317..c8f27f5 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -445,6 +445,9 @@ PortGUIDs list: There are two useful keywords for PortGUID definition: - 'ALL' means all end ports in this subnet. + - 'ALL_CAS' means all Channel Adapter end ports in this subnet. + - 'ALL_SWITCHES' means all Switch end ports in this subnet. + - 'ALL_ROUTERS' means all Router end ports in this subnet. - 'SELF' means subnet manager's port. Empty list means no ports in this partition. @@ -466,7 +469,7 @@ different PKey values will be generated for those definitions). Examples: - Default=0x7fff : ALL, SELF=full ; + Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ; NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ; -- 1.5.5 From cameron at harr.org Tue Jan 13 08:42:59 2009 From: cameron at harr.org (Cameron Harr) Date: Tue, 13 Jan 2009 09:42:59 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <496C81E3.2050105@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> Message-ID: <496CC493.3040207@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr, on 01/13/2009 02:56 AM wrote: >> Vladislav Bolkhovitin wrote: >>>>> I think srptthread=0 performs worse in this case, because with it >>>>> part of processing done in SIRQ, but seems scheduler make it be >>>>> done on the same CPU as fct0-worker, which does the data transfer >>>>> to your SSD device job. And this thread is always consumes about >>>>> 100% CPU, so it has less CPU time, hence less overall performance. >>>>> >>>>> So, try to affine fctX-worker, SCST threads and SIRQ processing on >>>>> different CPUs and check again. You can affine threads using >>>>> utility from >>>>> http://www.kernel.org/pub/linux/kernel/people/rml/cpu-affinity/, >>>>> how to affine IRQ see Documentation/IRQ-affinity.txt in your >>>>> kernel tree. >> >> I ran with the two fct-worker threads pinned to cpus 7,8, the >> scsi_tgt threads pinned to cpus 4, 5 or 6 and irqbalance pinned on >> cpus 1-3. I wasn't sure if I should play with the 8 ksoftirqd procs, >> since there is one process per cpu. From these results, I don't see a >> big difference, > > Hmm, you sent me before the following results: > > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 > iops=54934.31 > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 > iops=50199.90 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 > iops=51510.68 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 > iops=49951.89 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 > iops=51924.17 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 > iops=49874.57 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 > iops=79680.42 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 > iops=74504.65 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 > iops=78558.77 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 > iops=75224.25 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 > iops=75411.52 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 > iops=73238.46 > > I see quite a big improvement. For instance, for drives=1 > scst_threads=1 srptthread=1 case it is 36%. Or, do you use different > hardware, so those results can't be compared? Vlad, you've got a good eye. Unfortunately, those results can't really be compared because I believe the previous results were intentionally run in a worse-case performance scenario. However I did run no-affinity runs before the affinity runs and would say performance increase is variable and somewhat inconclusive: type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=76724.08 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=91318.28 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=60374.94 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=91618.18 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=63076.21 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=92251.24 type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=50539.96 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=57884.80 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=54502.85 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=93230.44 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=55941.89 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=94480.92 > >> but would still give srpt thread=1 a slight performance advantage. > > At this level CPU caches starting playing essential role. To get the > maximum performance the commands processing of each command should use > the same CPU L2+ cache(s), i.e. be done on the same physical CPU, but > on different cores. Most likely, affinity assigned by you was worse, > than the scheduler decisions. What's your CPU configuration? Please > send me the top/vmstat output during tests from the target as well as > your dmesg from the target just after it's booted. My CPU config on the target (where I did the affinity) is 2 quad-core Xeon E5440 @ 2.83GHz. I didn't have my script configured to dump top and vmstat, so here's data from a rerun (and I have attached requested info). I'm not sure what is accounting for the spike at the beginning, but it seems consistent. type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=104699.43 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=133928.98 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=82736.73 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=82221.42 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=70203.53 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=85628.45 type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=75646.90 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=87124.32 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=74545.84 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=88348.71 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=71837.15 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=84387.22 -------------- next part -------------- A non-text attachment was scrubbed... Name: dmesg.out Type: application/octet-stream Size: 34008 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: top.target.bz2 Type: application/octet-stream Size: 72060 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: vmstat.target.bz2 Type: application/octet-stream Size: 15983 bytes Desc: not available URL: From gmpc at sanger.ac.uk Tue Jan 13 09:09:40 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Tue, 13 Jan 2009 17:09:40 +0000 Subject: [ofa-general] OFED 1.4 packages for debian Message-ID: <496CCAD4.9090906@sanger.ac.uk> Hi all, I have made a set of OFED 1.4 infiniband for debian sid. The packages contain the utilities, libraries and kernel modules not currently packaged for debian. ftp://ftp.sanger.ac.uk/pub/gmpc/repository/infiniband/ sources.list entry: deb ftp://ftp.sanger.ac.uk/pub/gmpc/repository/infiniband ./ Release notes are here: http://www.sanger.ac.uk/Users/gmpc/infiniband.html The packages "work for me (TM)", but I would appreciate some wider testing before I get the packages pushed upstream. Any comments, bugfixes or suggestions more than welcome. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From HNGUYEN at de.ibm.com Tue Jan 13 09:24:59 2009 From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen) Date: Tue, 13 Jan 2009 18:24:59 +0100 Subject: [ofa-general] Re: [PATCH 3/13] drivers/infiniband/hw/ehca: Remove redundant test In-Reply-To: Message-ID: Thanks Nam Roland Dreier wrote on 21.12.2008 22:27:57: > Thanks, applied for 2.6.29 (ehca guys added to cc list just to make sure > this is OK). > > - R. From rdreier at cisco.com Tue Jan 13 09:52:43 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 Jan 2009 09:52:43 -0800 Subject: [ofa-general] [PATCH] mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: <496CB7F6.7060204@Voltaire.COM> (Moni Shoua's message of "Tue, 13 Jan 2009 17:49:10 +0200") References: <496CB6DE.5090100@Voltaire.COM> <496CB7F6.7060204@Voltaire.COM> Message-ID: Does this actually fix any problems that a user sees? If not I'm inclined to wait for 2.6.30. From rdreier at cisco.com Tue Jan 13 09:54:29 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 13 Jan 2009 09:54:29 -0800 Subject: [ofa-general] Re: [PATCH] IPoIB: refresh paths that might be invalid In-Reply-To: <496CB89C.7020108@Voltaire.COM> (Moni Shoua's message of "Tue, 13 Jan 2009 17:51:56 +0200") References: <496CB6DE.5090100@Voltaire.COM> <496CB89C.7020108@Voltaire.COM> Message-ID: I don't like this complexity, since waiting for an ARP probe is the standard way of revalidating paths. Wouldn't it be better just to lower the ARP refresh time if this is important to someone? - R. From vst at vlnb.net Tue Jan 13 10:08:03 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Tue, 13 Jan 2009 21:08:03 +0300 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <496CC493.3040207@harr.org> References: <48E386F6.5040502@fusionio.com> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> Message-ID: <496CD883.8040906@vlnb.net> Cameron Harr, on 01/13/2009 07:42 PM wrote: > Vladislav Bolkhovitin wrote: >> Cameron Harr, on 01/13/2009 02:56 AM wrote: >>> Vladislav Bolkhovitin wrote: >>>>>> I think srptthread=0 performs worse in this case, because with it >>>>>> part of processing done in SIRQ, but seems scheduler make it be >>>>>> done on the same CPU as fct0-worker, which does the data transfer >>>>>> to your SSD device job. And this thread is always consumes about >>>>>> 100% CPU, so it has less CPU time, hence less overall performance. >>>>>> >>>>>> So, try to affine fctX-worker, SCST threads and SIRQ processing on >>>>>> different CPUs and check again. You can affine threads using >>>>>> utility from >>>>>> http://www.kernel.org/pub/linux/kernel/people/rml/cpu-affinity/, >>>>>> how to affine IRQ see Documentation/IRQ-affinity.txt in your >>>>>> kernel tree. >>> I ran with the two fct-worker threads pinned to cpus 7,8, the >>> scsi_tgt threads pinned to cpus 4, 5 or 6 and irqbalance pinned on >>> cpus 1-3. I wasn't sure if I should play with the 8 ksoftirqd procs, >>> since there is one process per cpu. From these results, I don't see a >>> big difference, >> Hmm, you sent me before the following results: >> >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >> iops=54934.31 >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >> iops=50199.90 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >> iops=51510.68 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >> iops=49951.89 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >> iops=51924.17 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >> iops=49874.57 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >> iops=79680.42 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >> iops=74504.65 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >> iops=78558.77 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >> iops=75224.25 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >> iops=75411.52 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >> iops=73238.46 >> >> I see quite a big improvement. For instance, for drives=1 >> scst_threads=1 srptthread=1 case it is 36%. Or, do you use different >> hardware, so those results can't be compared? > Vlad, you've got a good eye. Unfortunately, those results can't really > be compared because I believe the previous results were intentionally > run in a worse-case performance scenario. However I did run no-affinity > runs before the affinity runs and would say performance increase is > variable and somewhat inconclusive: > > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=76724.08 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=91318.28 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=60374.94 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=91618.18 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=63076.21 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=92251.24 > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=50539.96 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=57884.80 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=54502.85 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=93230.44 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=55941.89 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=94480.92 For srptthread=0 case there is a consistent quite big increase. >>> but would still give srpt thread=1 a slight performance advantage. >> At this level CPU caches starting playing essential role. To get the >> maximum performance the commands processing of each command should use >> the same CPU L2+ cache(s), i.e. be done on the same physical CPU, but >> on different cores. Most likely, affinity assigned by you was worse, >> than the scheduler decisions. What's your CPU configuration? Please >> send me the top/vmstat output during tests from the target as well as >> your dmesg from the target just after it's booted. > My CPU config on the target (where I did the affinity) is 2 quad-core > Xeon E5440 @ 2.83GHz. I didn't have my script configured to dump top and > vmstat, so here's data from a rerun (and I have attached requested > info). I'm not sure what is accounting for the spike at the beginning, > but it seems consistent. > > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=104699.43 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=133928.98 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=82736.73 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=82221.42 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=70203.53 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=85628.45 > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=75646.90 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=87124.32 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=74545.84 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=88348.71 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=71837.15 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=84387.22 Why there is such a huge difference with the results you sent in the previous e-mail? For instance, for case drives=1 scst_threads=1 srptthread=1 104K vs 74K. What did you changed? What is content of /proc/interrupts after the tests? From cameron at harr.org Tue Jan 13 10:39:28 2009 From: cameron at harr.org (Cameron Harr) Date: Tue, 13 Jan 2009 11:39:28 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <496CD883.8040906@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> Message-ID: <496CDFE0.2030601@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr, on 01/13/2009 07:42 PM wrote: >> Vlad, you've got a good eye. Unfortunately, those results can't >> really be compared because I believe the previous results were >> intentionally run in a worse-case performance scenario. However I did >> run no-affinity runs before the affinity runs and would say >> performance increase is variable and somewhat inconclusive: >> >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >> iops=76724.08 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >> iops=91318.28 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >> iops=60374.94 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >> iops=91618.18 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >> iops=63076.21 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >> iops=92251.24 >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >> iops=50539.96 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >> iops=57884.80 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >> iops=54502.85 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >> iops=93230.44 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >> iops=55941.89 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >> iops=94480.92 > > For srptthread=0 case there is a consistent quite big increase. For srptthread=0, there is indeed a difference between no-affinity and affinity. However, here I meant there's not much of a difference between srptthread=[01] on a number of the points - and overall on this particular run it still seems like having the srp thread enabled still gives better performance. >> My CPU config on the target (where I did the affinity) is 2 quad-core >> Xeon E5440 @ 2.83GHz. I didn't have my script configured to dump top >> and vmstat, so here's data from a rerun (and I have attached >> requested info). I'm not sure what is accounting for the spike at the >> beginning, but it seems consistent. >> >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >> iops=104699.43 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >> iops=133928.98 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >> iops=82736.73 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >> iops=82221.42 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >> iops=70203.53 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >> iops=85628.45 >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >> iops=75646.90 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >> iops=87124.32 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >> iops=74545.84 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >> iops=88348.71 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >> iops=71837.15 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >> iops=84387.22 > > Why there is such a huge difference with the results you sent in the > previous e-mail? For instance, for case drives=1 scst_threads=1 > srptthread=1 104K vs 74K. What did you changed? That's what I meant when I wasn't sure what was causing the spike at the beginning. I didn't really do anything different other than rebooting. One factor may be due to the supply of free blocks in the flash media. As the number of free blocks decreases, garbage collection can increase to free up previously-used blocks. However, there appears to be some variance in the numbers that I can't account for. I just did another run after forcing free blocks to a critical level and got 64K IOPs. > > What is content of /proc/interrupts after the tests? You can see the HCA being interrupt-intensive, as well as iodrive, which I'm surprised to see because I locked its worker threads to cpus 6 and 7. CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 11173938 0 0 0 0 0 0 0 IO-APIC-edge timer 1: 9 0 0 0 0 0 0 0 IO-APIC-edge i8042 4: 14 0 0 0 0 0 0 0 IO-APIC-edge serial 8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 0 0 0 0 IO-APIC-level acpi 12: 106 0 0 0 0 0 0 0 IO-APIC-edge i8042 14: 35021 3114 0 0 47867 13692 0 0 IO-APIC-edge ide0 66: 19587 818 0 0 3743 3171 0 0 IO-APIC-level uhci_hcd:usb3, libata 74: 4209 159 0 0 0 0 0 0 PCI-MSI-X mlx4_core (async) 82: 273658543 63380024 0 0 0 0 0 0 PCI-MSI-X mlx4_core (comp) 90: 1 0 0 0 0 0 0 0 PCI-MSI-X dev23945 98: 2 0 0 0 0 0 0 0 PCI-MSI-X dev23945 (queue 0) 106: 0 0 0 0 0 0 0 0 PCI-MSI-X dev23945 (queue 1) 114: 0 0 0 0 0 0 0 0 PCI-MSI-X dev23945 (queue 2) 122: 0 0 0 0 0 0 0 0 PCI-MSI-X dev23945 (queue 3) 130: 0 0 0 0 0 0 0 0 PCI-MSI-X eth4 (queue 4) 138: 0 0 0 0 0 0 0 0 PCI-MSI-X eth4 (queue 5) 146: 0 0 0 0 0 0 0 0 PCI-MSI-X eth4 (queue 6) 154: 0 0 0 0 0 0 0 0 PCI-MSI-X eth4 (queue 7) 162: 58 0 0 0 0 0 48424 32510 PCI-MSI eth2 169: 149935 418 0 179925438 5982077 2103882 27511574 0 IO-APIC-level iodrive 177: 84565 10300 0 24953166 6645416 269061 38519676 28147 IO-APIC-level ehci_hcd:usb1, uhci_hcd:usb2, iodrive 185: 0 0 0 0 0 0 0 0 IO-APIC-level uhci_hcd:usb4 NMI: 4740 1805 1330 3495 2607 3838 3233 5481 LOC: 11171500 11171423 11171374 11171279 11171219 11171128 11171069 11170985 ERR: 0 MIS: 0 From PHF at zurich.ibm.com Wed Jan 14 02:21:25 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Wed, 14 Jan 2009 11:21:25 +0100 Subject: [ofa-general] iWARP: Zero STag, OFED 1.3 vs 1.4 Message-ID: Hello, I recently upgraded from OFED 1.3 to 1.4 and the behaviour of an STag of zero seems to have changed. Before, the following code worked: /* create send work request (for synchronization)*/ sge.addr = 0; sge.length = 0; sge.lkey = 0; send_wr.wr_id = tx_wr_id++; send_wr.next = NULL; send_wr.sg_list = &sge; send_wr.num_sge = 1; send_wr.opcode = IBV_WR_SEND; send_wr.send_flags = 0; /* post send synchronization WR */ ret = ibv_post_send(ctx_conn.qp, &send_wr, &bad_wr); if (ret) { msg_Err("RDMA: failed to post send"); return -1; } but now it results in the following error: Jan 13 17:34:46 borus kernel: iwch_ev_dispatch - CQE Err qpid 0xa0 opcode 3 status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 Jan 13 17:34:46 borus kernel: post_qp_event - AE qpid 0xa0 opcode 3 status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 I wanted to do a zero-length send w/o a memory region to solve the issue that the MPA initiator must send the first FPDU. Question1: Why didn't that result in an error before? Question2: Is there a way of doing a zero-length operation w/o having to create a MR? Many thanks for you advice, Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: From monis at Voltaire.COM Wed Jan 14 02:30:22 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 14 Jan 2009 12:30:22 +0200 Subject: [ofa-general] [PATCH] mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: References: <496CB6DE.5090100@Voltaire.COM> <496CB7F6.7060204@Voltaire.COM> Message-ID: <496DBEBE.5020206@Voltaire.COM> Roland Dreier wrote: > Does this actually fix any problems that a user sees? If not I'm > inclined to wait for 2.6.30. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Yes. This one (as its twin patch for ib_mthca) fixes a real problem. From vlad at lists.openfabrics.org Wed Jan 14 03:15:10 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 14 Jan 2009 03:15:10 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090114-0200 daily build status Message-ID: <20090114111510.71A5AE610CB@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From monis at Voltaire.COM Wed Jan 14 03:35:46 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 14 Jan 2009 13:35:46 +0200 Subject: [ofa-general] Re: [PATCH] IPoIB: refresh paths that might be invalid In-Reply-To: References: <496CB6DE.5090100@Voltaire.COM> <496CB89C.7020108@Voltaire.COM> Message-ID: <496DCE12.2070202@Voltaire.COM> Roland Dreier wrote: > I don't like this complexity, since waiting for an ARP probe is the > standard way of revalidating paths. Wouldn't it be better just to lower > the ARP refresh time if this is important to someone? > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Lowering time between ARP probes might be a simpler solution but it has disadvantages 1. Frequent ARP probes (few seconds gap) might be disturbing during normal operation and I don't know if network managers would be happy to change it. 2. The time between probes has minimum value which might not be small enough to those who want fast recovery. I am resending the patch with a bug fix I found plus better change log in case you consider to apply it (I hope so) From monis at Voltaire.COM Wed Jan 14 03:36:49 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 14 Jan 2009 13:36:49 +0200 Subject: [ofa-general] [PATCH v2] IPoIB: refresh paths that might be invalid In-Reply-To: References: <496CB6DE.5090100@Voltaire.COM> <496CB89C.7020108@Voltaire.COM> Message-ID: <496DCE51.6070808@Voltaire.COM> If a standby SM takes over and if only some of the nodes change their LID as a result, the other nodes get an IPOIB_FLUSH_LIGHT event that doesn't cause flushing of paths but only marks them as probably invalid. Path refresh will happen only after an ARP probe which may take some time to happen (tens of seconds). This patch adds a task that is responsible to restart the lookup of possibly invalid paths in 2 occasions: - handling of IPOIB_FLUSH_LIGHT event - when path completion returns with bad status. This way paths are being refreshed much faster. Signed-off-by: Moni Shoua --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +++- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 - drivers/infiniband/ulp/ipoib/ipoib_main.c | 38 ++++++++++++++++++++++++++---- 3 files changed, 38 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 753a983..89f574c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -298,6 +298,7 @@ struct ipoib_dev_priv { struct work_struct flush_heavy; struct work_struct restart_task; struct delayed_work ah_reap_task; + struct delayed_work path_refresh_task; struct ib_device *ca; u8 port; @@ -378,7 +379,7 @@ struct ipoib_path { struct rb_node rb_node; struct list_head list; - int valid; + u8 stale; }; struct ipoib_neigh { @@ -442,8 +443,9 @@ int ipoib_add_umcast_attr(struct net_device *dev); void ipoib_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_ah *address, u32 qpn); void ipoib_reap_ah(struct work_struct *work); +void ipoib_refresh_paths(struct work_struct *work); -void ipoib_mark_paths_invalid(struct net_device *dev); +void ipoib_mark_paths_stale(struct net_device *dev); void ipoib_flush_paths(struct net_device *dev); struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index a192581..1a50076 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -962,7 +962,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, } if (level == IPOIB_FLUSH_LIGHT) { - ipoib_mark_paths_invalid(dev); + ipoib_mark_paths_stale(dev); ipoib_mcast_dev_flush(dev); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index dce0443..b2ec845 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -352,7 +352,7 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter, #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ -void ipoib_mark_paths_invalid(struct net_device *dev) +void ipoib_mark_paths_stale(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path, *tp; @@ -360,11 +360,15 @@ void ipoib_mark_paths_invalid(struct net_device *dev) spin_lock_irq(&priv->lock); list_for_each_entry_safe(path, tp, &priv->path_list, list) { - ipoib_dbg(priv, "mark path LID 0x%04x GID %pI6 invalid\n", + ipoib_dbg(priv, "mark path LID 0x%04x GID %pI6 stale\n", be16_to_cpu(path->pathrec.dlid), path->pathrec.dgid.raw); - path->valid = 0; + path->stale = 1; } + if (!list_empty(&priv->path_list)) + queue_delayed_work(ipoib_workqueue, &priv->path_refresh_task, + round_jiffies_relative(HZ)); + spin_unlock_irq(&priv->lock); } @@ -427,6 +431,10 @@ static void path_rec_completion(int status, if (!ib_init_ah_from_path(priv->ca, priv->port, pathrec, &av)) ah = ipoib_create_ah(dev, priv->pd, &av); + } else { + path->stale = 1; + queue_delayed_work(ipoib_workqueue, &priv->path_refresh_task, + round_jiffies_relative(HZ)); } spin_lock_irqsave(&priv->lock, flags); @@ -477,7 +485,6 @@ static void path_rec_completion(int status, while ((skb = __skb_dequeue(&neigh->queue))) __skb_queue_tail(&skqueue, skb); } - path->valid = 1; } path->query = NULL; @@ -551,9 +558,29 @@ static int path_rec_start(struct net_device *dev, return path->query_id; } + path->stale = 0; return 0; } +void ipoib_refresh_paths(struct work_struct *work) +{ + struct ipoib_dev_priv *priv = + container_of(work, struct ipoib_dev_priv, path_refresh_task.work); + struct net_device *dev = priv->dev; + struct ipoib_path *path, *tp; + + spin_lock_irq(&priv->lock); + list_for_each_entry_safe(path, tp, &priv->path_list, list) { + ipoib_dbg(priv, "restart path LID 0x%04x GID %pI6\n", + be16_to_cpu(path->pathrec.dlid), + path->pathrec.dgid.raw); + if (path->stale) + path_rec_start(dev, path); + } + + spin_unlock_irq(&priv->lock); +} + static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -656,7 +683,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, spin_lock_irqsave(&priv->lock, flags); path = __path_find(dev, phdr->hwaddr + 4); - if (!path || !path->valid) { + if (!path) { if (!path) path = path_rec_create(dev, phdr->hwaddr + 4); if (path) { @@ -1070,6 +1097,7 @@ static void ipoib_setup(struct net_device *dev) INIT_WORK(&priv->flush_heavy, ipoib_ib_dev_flush_heavy); INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); + INIT_DELAYED_WORK(&priv->path_refresh_task, ipoib_refresh_paths); } struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) From sashak at voltaire.com Wed Jan 14 07:42:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jan 2009 17:42:00 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c Fix memory leak for QOS string parameters. In-Reply-To: <494E2E5A.8050008@gmail.com> References: <494E2E5A.8050008@gmail.com> Message-ID: <20090114154200.GB1640@sashak.voltaire.com> Hi Eli, On 13:54 Sun 21 Dec , Eli Dorfman (Voltaire) wrote: > Fix memory leak for QOS string parameters. > > Signed-off-by: Slava Strebkov > > --- > opensm/opensm/osm_subnet.c | 21 +++++++++++++++++++++ > 1 files changed, 21 insertions(+), 0 deletions(-) > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index 122d4dd..f8b29f8 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -331,6 +331,21 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt) > opt->sl2vl = NULL; > } > > +static void subn_free_qos_options(IN osm_qos_options_t * opt) > +{ > + if ((opt->vlarb_high) && (opt->vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH)) { > + free(opt->vlarb_high); > + } > + > + if ((opt->vlarb_low) && (opt->vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW)) { > + free(opt->vlarb_low); > + } > + > + if ((opt->sl2vl) && (opt->sl2vl != OSM_DEFAULT_QOS_SL2VL)) { > + free(opt->sl2vl); > + } > +} With gcc-4.3.2 using '-Wall' flag I get warning here: "comparison with string literal results in unspecified behavior" It is actually true since used OSM_DEFAULT_QOS_* macros are defined as strings - it is something equal to: char *p = "123456"; .... if (p != "123456") free(p1); Gcc is smart enough and uses the same string constant "123456" in both cases (so your patch actually works), but don't think that it is guaranteed in C language. So I think to rework this part - we can always allocate string qos config parameters (just similar to other config), and free it when it is not NULL. Something like this: diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index a6db304..94b6332 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -333,13 +333,13 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt) static void subn_free_qos_options(IN osm_qos_options_t * opt) { - if (opt->vlarb_high && opt->vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH) + if (opt->vlarb_high) free(opt->vlarb_high); - if (opt->vlarb_low && opt->vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW) + if (opt->vlarb_low) free(opt->vlarb_low); - if (opt->sl2vl && opt->sl2vl != OSM_DEFAULT_QOS_SL2VL) + if (opt->sl2vl) free(opt->sl2vl); } @@ -803,7 +803,7 @@ static void subn_verify_vlarb(char **vlarb, const char *prefix, if (*vlarb == NULL) { log_report(" Invalid Cached Option: %s_vlarb_%s: " "Using Default\n", prefix, suffix); - *vlarb = dflt; + *vlarb = strdup(dflt); return; } @@ -872,7 +872,7 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt) if (*sl2vl == NULL) { log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n", prefix); - *sl2vl = dflt; + *sl2vl = strdup(dflt); return; } Thoughts? Sasha From vlad at dev.mellanox.co.il Wed Jan 14 08:03:27 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 14 Jan 2009 18:03:27 +0200 Subject: [ofa-general] installing ofed-1.4 In-Reply-To: <18790.27045.398492.560860@ifh-tuww01.ifh.uni-karlsruhe.de> References: <18790.27045.398492.560860@ifh-tuww01.ifh.uni-karlsruhe.de> Message-ID: <496E0CCF.9080903@dev.mellanox.co.il> Markus Uhlmann wrote: > hello, > > i am installing ofed-1.4 (as of 8. jan. 2009) on a 64 bit debian > system. > > after installing a number of rpms without problems, the installation > discontinues with the following error (from the log file): > > -------------------------------------------- > Processing files: kernel-ib-1.4-2.6.18_6_amd64 > error: File not found: /var/tmp/OFED/lib/modules/2.6.18-6-amd64/updates/kernel/drivers/net/cxgb3 > error: File not found: /var/tmp/OFED/lib/modules/2.6.18-6-amd64/updates/kernel/drivers/net/mlx4 > -------------------------------------------- > > these directories were apparently not created during preceeding > installation steps. however, the modules have indeed been built > correctly, here: > > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/lib/modules/2.6.18-6-amd64/extra/drivers/net/cxgb3/ > > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/lib/modules/2.6.18/extra/drivers/net/mlx4/ > > (containing kernel modules "cxgb3.ko" and "mlx4_core.ko", "mlx4_en.ko") > > there must be some confuson in the install script, or wherever those > paths come from. > > is there a way to adjust that? > > thanks, > > mu > > ps: this is near the top of the log-file: > > Installing /root/non-free/infiniband/OFED-1.4-20090108-0600/SRPMS/ofa_kernel-1.4-ofed20090108.src.rpm > > Hello Markus, Please send me the full rpmbuild log file (can be found under /tmp/OFED.xxxxxx). From the log above, I see that the kernel version is *2.6.18-6-amd64*, while required modules installed under /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/lib/modules/*2.6.18*/extra. Do you have KVERSION or MODULES_DIR environment variables defined? Regards, Vladimir From rdreier at cisco.com Wed Jan 14 08:08:50 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Jan 2009 08:08:50 -0800 Subject: [ofa-general] [PATCH] mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: <496DBEBE.5020206@Voltaire.COM> (Moni Shoua's message of "Wed, 14 Jan 2009 12:30:22 +0200") References: <496CB6DE.5090100@Voltaire.COM> <496CB7F6.7060204@Voltaire.COM> <496DBEBE.5020206@Voltaire.COM> Message-ID: > Yes. This one (as its twin patch for ib_mthca) fixes a real problem. What is the real problem? - R. From monis at Voltaire.COM Wed Jan 14 08:20:49 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 14 Jan 2009 18:20:49 +0200 Subject: [ofa-general] [PATCH] mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: References: <496CB6DE.5090100@Voltaire.COM> <496CB7F6.7060204@Voltaire.COM> <496DBEBE.5020206@Voltaire.COM> Message-ID: <496E10E1.9010806@Voltaire.COM> Roland Dreier wrote: > > Yes. This one (as its twin patch for ib_mthca) fixes a real problem. > > What is the real problem? > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > Voltaire SM, for example, sends one MAD when it takes over another SM. This one MAD tells the client to re-register and if necessary, changes the client LID. Old code won't generate LID change event and the IPoIB would only do a "light flush" of paths. This causes a communication breakdown for a very long time. From swise at opengridcomputing.com Wed Jan 14 08:24:36 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 Jan 2009 10:24:36 -0600 Subject: [ofa-general] iWARP: Zero STag, OFED 1.3 vs 1.4 In-Reply-To: References: Message-ID: <496E11C4.4020905@opengridcomputing.com> Is this with the Chelsio RNIC? Philip Frey1 wrote: > > Hello, > > I recently upgraded from OFED 1.3 to 1.4 and the behaviour of an STag > of zero seems to have changed. > > Before, the following code worked: > > /* create send work request (for synchronization)*/ > sge.addr = 0; > sge.length = 0; > sge.lkey = 0; > > send_wr.wr_id = tx_wr_id++; > send_wr.next = NULL; > send_wr.sg_list = &sge; > send_wr.num_sge = 1; > send_wr.opcode = IBV_WR_SEND; > send_wr.send_flags = 0; > > /* post send synchronization WR */ > ret = ibv_post_send(ctx_conn.qp, &send_wr, &bad_wr); > if (ret) { > msg_Err("RDMA: failed to post send"); > return -1; > } > > but now it results in the following error: > > Jan 13 17:34:46 borus kernel: iwch_ev_dispatch - CQE Err qpid 0xa0 > opcode 3 status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 > Jan 13 17:34:46 borus kernel: post_qp_event - AE qpid 0xa0 opcode 3 > status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 > > I wanted to do a zero-length send w/o a memory region to solve the > issue that the MPA initiator must send the first FPDU. > > Question1: Why didn't that result in an error before? > Question2: Is there a way of doing a zero-length operation w/o having > to create a MR? > > Many thanks for you advice, > Philip > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From jackm at mellanox.co.il Wed Jan 14 08:37:45 2009 From: jackm at mellanox.co.il (Jack Morgenstein) Date: Wed, 14 Jan 2009 18:37:45 +0200 Subject: [ofa-general] [PATCH] mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: <496E10E1.9010806@Voltaire.COM> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0178B788@mtlexch01.mtl.com> If we are generating a LID CHANGE event, why do we need the CLIENT REREGISTER at all? If both are generated, you will do BOTH light and heavy sweeps -- isn't this a waste? Preferable to check for LID CHANGE first (and generate the event if needed). Only if there is no LID CHANGE event generated, check for CLIENT REREG. Thoughts? --Original Message----- > From: Moni Shoua [mailto:monis at Voltaire.COM] > Sent: Wednesday, January 14, 2009 6:21 PM > To: Roland Dreier > Cc: Jack Morgenstein; Olga Stern; Yossi Etigin; OpenFabrics General > Subject: Re: [ofa-general] [PATCH] mlx4_ib: Fix dispatch of > IB_EVENT_LID_CHANGE > > > Roland Dreier wrote: > > > Yes. This one (as its twin patch for ib_mthca) fixes a real > > problem. > > > > What is the real problem? > > > > - R. > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > Voltaire SM, for example, sends one MAD when it takes over > another SM. This one MAD tells the client to re-register and > if necessary, changes the client LID. Old code won't generate > LID change event and the IPoIB would only do a "light flush" > of paths. > This causes a communication breakdown for a very long time. > From gmpc at sanger.ac.uk Wed Jan 14 08:54:53 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed, 14 Jan 2009 16:54:53 +0000 Subject: [ofa-general] installing ofed-1.4 In-Reply-To: <496E0CCF.9080903@dev.mellanox.co.il> References: <18790.27045.398492.560860@ifh-tuww01.ifh.uni-karlsruhe.de> <496E0CCF.9080903@dev.mellanox.co.il> Message-ID: <496E18DD.4060105@sanger.ac.uk> >> >> these directories were apparently not created during preceeding >> installation steps. however, the modules have indeed been built >> correctly, here: >> >> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/lib/modules/2.6.18-6-amd64/extra/drivers/net/cxgb3/ >> >> >> /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/lib/modules/2.6.18/extra/drivers/net/mlx4/ >> >> >> (containing kernel modules "cxgb3.ko" and "mlx4_core.ko", "mlx4_en.ko") >> >> there must be some confuson in the install script, or wherever those >> paths come from. >> >> is there a way to adjust that? >> If you extract the kernel SRPM into a tarball with alien, you can run the kernel module build manually. alien -t ofa_kernel-1.4-ofed1.4.src.rpm (extract the resulting tarball) Note that you have to run the ofed_patch.sh script first, as extra minor versions numbers in the kernel confuses the patch script and it will try and apply the SLES kernel patches rather than the vanilla kernel ones. ./ofed_scripts/ofed_patch.sh --kernel-version=2.6.18 ./configure --kernel-sources=/usr/src/linux-headers-2.6.18-6 \ --kernel-version=2.6.18-6-amd64 \ --with-core-mod --with-ipoib-mod --with-ipoib-cm --with-sdp-mod \ --with-srp-mod --with-srp-target-mod \ --with-user_mad-mod --with-user_access-mod \ --with-addr_trans-mod --with-mthca-mod --with-mlx4-mod --with-mlx4_core-mod \ --with-mlx4_en-mod --with-mlx4_inf-mod --with-rds-mod --with-madeye-mod \ --with-qlgc_vnic-mod --with-qlgc_vnic_stats-mod --with-cxgb3-mod \ --with-nes-mod make && make install If that is too much like hard work, there is a minimally tested deb at: ftp://ftp.sanger.ac.uk/pub/gmpc/repository/etch/ofa-kernel-source_1.4-1_all.deb You can then build the modules with module-assistant dpkg -i ofa-kernel-source_1.4_all.deb module-assistant prepare module-assistant -t build ofa-kernel Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jackm at dev.mellanox.co.il Wed Jan 14 09:01:44 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Wed, 14 Jan 2009 19:01:44 +0200 Subject: [ofa-general] iWARP: Zero STag, OFED 1.3 vs 1.4 In-Reply-To: References: Message-ID: <200901141901.44994.jackm@dev.mellanox.co.il> On Wednesday 14 January 2009 12:21, Philip Frey1 wrote: > Hello, > > I recently upgraded from OFED 1.3 to 1.4 and the behaviour of an STag of > zero seems to have changed. > Did you try sending with send_wr.sg_list = NULL; send_wr.num_sge = 0; ? (if this works, it should result in a send with no data) - Jack > Before, the following code worked: > > /* create send work request (for synchronization)*/ > sge.addr = 0; > sge.length = 0; > sge.lkey = 0; > > send_wr.wr_id = tx_wr_id++; > send_wr.next = NULL; > send_wr.sg_list = &sge; > send_wr.num_sge = 1; > send_wr.opcode = IBV_WR_SEND; > send_wr.send_flags = 0; > > /* post send synchronization WR */ > ret = ibv_post_send(ctx_conn.qp, &send_wr, &bad_wr); > if (ret) { > msg_Err("RDMA: failed to post send"); > return -1; > } > > but now it results in the following error: > > Jan 13 17:34:46 borus kernel: iwch_ev_dispatch - CQE Err qpid 0xa0 opcode > 3 status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 > Jan 13 17:34:46 borus kernel: post_qp_event - AE qpid 0xa0 opcode 3 status > 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 > > I wanted to do a zero-length send w/o a memory region to solve the issue > that the MPA initiator must send the first FPDU. > > Question1: Why didn't that result in an error before? > Question2: Is there a way of doing a zero-length operation w/o having to > create a MR? > > Many thanks for you advice, > Philip From swise at opengridcomputing.com Wed Jan 14 09:33:29 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 14 Jan 2009 11:33:29 -0600 Subject: [ofa-general] iWARP: Zero STag, OFED 1.3 vs 1.4 In-Reply-To: <496E11C4.4020905@opengridcomputing.com> References: <496E11C4.4020905@opengridcomputing.com> Message-ID: <496E21E9.7030301@opengridcomputing.com> Actually I see that this is cxgb3 based on your log information. Try setting num_sge to 0. Steve Wise wrote: > Is this with the Chelsio RNIC? > > > > > Philip Frey1 wrote: >> >> Hello, >> >> I recently upgraded from OFED 1.3 to 1.4 and the behaviour of an STag >> of zero seems to have changed. >> >> Before, the following code worked: >> >> /* create send work request (for synchronization)*/ >> sge.addr = 0; >> sge.length = 0; >> sge.lkey = 0; >> >> send_wr.wr_id = tx_wr_id++; >> send_wr.next = NULL; >> send_wr.sg_list = &sge; >> send_wr.num_sge = 1; >> send_wr.opcode = IBV_WR_SEND; >> send_wr.send_flags = 0; >> >> /* post send synchronization WR */ >> ret = ibv_post_send(ctx_conn.qp, &send_wr, &bad_wr); >> if (ret) { >> msg_Err("RDMA: failed to post send"); >> return -1; >> } >> >> but now it results in the following error: >> >> Jan 13 17:34:46 borus kernel: iwch_ev_dispatch - CQE Err qpid 0xa0 >> opcode 3 status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 >> Jan 13 17:34:46 borus kernel: post_qp_event - AE qpid 0xa0 opcode 3 >> status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 >> >> I wanted to do a zero-length send w/o a memory region to solve the >> issue that the MPA initiator must send the first FPDU. >> >> Question1: Why didn't that result in an error before? >> Question2: Is there a way of doing a zero-length operation w/o having >> to create a MR? >> >> Many thanks for you advice, >> Philip >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From PHF at zurich.ibm.com Wed Jan 14 09:37:31 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Wed, 14 Jan 2009 18:37:31 +0100 Subject: [ofa-general] iWARP: Zero STag, OFED 1.3 vs 1.4 In-Reply-To: <496E11C4.4020905@opengridcomputing.com> References: <496E11C4.4020905@opengridcomputing.com> Message-ID: Yes, it is. From: Steve Wise To: Philip Frey1 Cc: Andreas Hasler1/Zurich/IBM at IBMCH, general at lists.openfabrics.org Date: 01/14/2009 05:25 PM Subject: Re: [ofa-general] iWARP: Zero STag, OFED 1.3 vs 1.4 Is this with the Chelsio RNIC? Philip Frey1 wrote: > > Hello, > > I recently upgraded from OFED 1.3 to 1.4 and the behaviour of an STag > of zero seems to have changed. > > Before, the following code worked: > > /* create send work request (for synchronization)*/ > sge.addr = 0; > sge.length = 0; > sge.lkey = 0; > > send_wr.wr_id = tx_wr_id++; > send_wr.next = NULL; > send_wr.sg_list = &sge; > send_wr.num_sge = 1; > send_wr.opcode = IBV_WR_SEND; > send_wr.send_flags = 0; > > /* post send synchronization WR */ > ret = ibv_post_send(ctx_conn.qp, &send_wr, &bad_wr); > if (ret) { > msg_Err("RDMA: failed to post send"); > return -1; > } > > but now it results in the following error: > > Jan 13 17:34:46 borus kernel: iwch_ev_dispatch - CQE Err qpid 0xa0 > opcode 3 status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 > Jan 13 17:34:46 borus kernel: post_qp_event - AE qpid 0xa0 opcode 3 > status 0x1 type 1 wrid.hi 0x0 wrid.lo 0x80000000 > > I wanted to do a zero-length send w/o a memory region to solve the > issue that the MPA initiator must send the first FPDU. > > Question1: Why didn't that result in an error before? > Question2: Is there a way of doing a zero-length operation w/o having > to create a MR? > > Many thanks for you advice, > Philip > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Jan 14 10:04:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jan 2009 20:04:40 +0200 Subject: [ofa-general] [PATCH v2 1/2] libibmad add PortXmitWait and CounterSelect2 to fields. In-Reply-To: <496CA5CE.3040804@gmail.com> References: <4950E07F.6090104@gmail.com> <496CA539.1040204@gmail.com> <496CA5CE.3040804@gmail.com> Message-ID: <20090114180440.GI1441@sashak.voltaire.com> On 16:31 Tue 13 Jan , Eli Dorfman (Voltaire) wrote: > add PortXmitWait counter to fields. > add CounterSelect2 to fields to allow reset of PortXmitWait(MgtWg comment#4527) > > Signed-off-by: Eli Dorfman Applied. Thanks. Sasha From michael.heinz at qlogic.com Wed Jan 14 10:10:47 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Wed, 14 Jan 2009 12:10:47 -0600 Subject: [ofa-general] (no subject) Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74684092@MNEXMB1.qlogic.org> We've repeatedly run into a problem where mstvpd can hang on certain HCA models, and on HCAs that have failed. This is an issue for us, because mstvpd is one of the tools we use to automatically capture information about a system that's experiencing problems. I previously opened PR 1440 on the problem, but it doesn't appear to have been investigated. For this reason, I'm proposing the attached patch. Basically, it adds a configurable time out and it terminates the attempt to read the VPD area if it fails to retrieve data before the time out expires. The default is 30 seconds. It uses a stupid busy-loop to check for time out because that's what the existing code does. Other changes were also made to support this change - I changed how command line options are processed and extended the usage() function. --- vpd.c.orig 2009-01-08 16:56:12.000000000 -0500 +++ vpd.c 2009-01-08 17:44:01.000000000 -0500 @@ -44,6 +44,13 @@ #include #include #include +#include + +/* pread is non-blocking, so we loop until we find data. Unfortunately, + * we can loop forever if the HCA is crashed or if the wrong device is + * specified as an argument. So, we set time outs. + */ +static clock_t ticks_per_sec, start_t, curr_t, timeout_t = 30; struct vpd_cap { unsigned char id; @@ -168,7 +175,13 @@ if (ret != sizeof addr_flag) return ret; + start_t = times(NULL); while((addr_flag[1] & VPD_FLAG) != VPD_FLAG_READ_READY) { + curr_t = times(NULL); + if ((curr_t - start_t) / ticks_per_sec > timeout_t) { + return -EIO; + } + ret = pread(device, addr_flag, sizeof addr_flag, vpd_cap_offset + VPD_ADDR_OFFSET); if (ret != sizeof addr_flag) @@ -437,24 +450,34 @@ rc = 1; goto usage; } - if (argc == 3) { - if (!strcmp("-m", argv[1])) { - argv++; - argc--; - m = 1; - } else if (!strcmp("-n", argv[1])) { - argv++; - argc--; - n = 1; - } else { - rc = 2; - goto usage; + + ticks_per_sec = sysconf(_SC_CLK_TCK); + + do + { + i=getopt(argc, argv, "mnt:"); + if (i<0) { + break; } - } - name = argv[1]; - argv++; - argc--; + switch (i) { + case 'm': + m=1; + break; + case 'n': + n=1; + break; + case 't': + timeout_t = strtol(optarg, NULL, 0); + break; + default: + goto usage; + } + } while (1 == 1); + + name = argv[optind]; + argc -= optind; + argv += optind; if (!strcmp("-", name)) { if (fread(d, VPD_MAX_SIZE, 1, stdin) != 1) @@ -486,6 +509,14 @@ return 0; usage: - fprintf(stderr, "Usage: %s [-m|-n] [-- keyword ...]\n", argv[0]); + fprintf(stderr, "Usage: %s [-m|-n] [-t ##] [-- keyword ...]\n", argv[0]); + fprintf(stderr, "-m\tDump raw VPD data to stdout.\n"); + fprintf(stderr, "-n\tDo not validate check sum.\n"); + fprintf(stderr, "-t ##\tTime out after ## seconds. (Default is 30.)\n\n"); + fprintf(stderr, "file\tThe PCI id number of the HCA (for example, \"2:00.0\"),\n"); + fprintf(stderr, "\tthe device name (such as \"mlx4_0\")\n"); + fprintf(stderr, "\tthe absolute path to the device (\"/sys/class/infiniband/mlx4_0/device\")\n"); + fprintf(stderr, "\tor '-' to read VPD data from the standard input.\n\n"); + fprintf(stderr, "keyword(s): Only display the requested information. (ID, PN, EC, SN, etc...)\n"); return rc; } -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mstvpd.patch Type: application/octet-stream Size: 2501 bytes Desc: mstvpd.patch URL: From markus.uhlmann at ifh.uka.de Wed Jan 14 10:16:29 2009 From: markus.uhlmann at ifh.uka.de (Markus Uhlmann) Date: Wed, 14 Jan 2009 19:16:29 +0100 Subject: [ofa-general] installing ofed-1.4 Message-ID: <18798.11261.165834.519441@ifh-tuww01.ifh.uni-karlsruhe.de> Thanks Vladimir & Guy. Since there was no echo on this list for a few days, I went ahead and installed a 2.6.24 kernel (with debian patches). Installing ofed-1.4 (not latest) then went more or less okay (small problem: when building "mpitest" for mvapich-1, the path in the script "mpicc" was wrong: it contained the temporary build directory's path, but the files were already installed in the final install directory; this could be fixed manually). That installation seems to work (the speed is not right, though; e.g. the OSU benchmarks seem somewhat off). Another small thing i noticed: is it correct that I manually need to remove the infiniband-related modules already loaded from the "stock" kernel(i.e. "rmmod" on ib_mthca,ib_mad,ib_core,mlx4_core) before running "/etc/init.d/openibd start" ??? >Hello Markus, >Please send me the full rpmbuild log file (can be found under >/tmp/OFED.xxxxxx). Sorry, apparently I did not conserve the original log files under "/tmp". I should have saved them. Actually, I tried to submit to bugzilla - as indicated - but it would have been one long copy-paste, since there was no way to attach a file! > From the log above, I see that the kernel version is *2.6.18-6-amd64*, >while required >modules installed under >/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/lib/modules/*2.6.18*/extra. >Do you have KVERSION or MODULES_DIR environment variables defined? No, these variables are not defined. Anyway, I appreciate your effort in the debian direction. Cheers, MU From vlad at dev.mellanox.co.il Wed Jan 14 10:28:05 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 14 Jan 2009 20:28:05 +0200 Subject: [ofa-general] installing ofed-1.4 In-Reply-To: <18798.11261.165834.519441@ifh-tuww01.ifh.uni-karlsruhe.de> References: <18798.11261.165834.519441@ifh-tuww01.ifh.uni-karlsruhe.de> Message-ID: <496E2EB5.2070805@dev.mellanox.co.il> Markus Uhlmann wrote: > > Another small thing i noticed: is it correct that I manually need to > remove the infiniband-related modules already loaded from the "stock" > kernel(i.e. "rmmod" on ib_mthca,ib_mad,ib_core,mlx4_core) before > running "/etc/init.d/openibd start" ??? > Alternatively, you can use "/etc/init.d/openibd restart". Regards, Vladimir From michael.heinz at qlogic.com Wed Jan 14 10:41:48 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Wed, 14 Jan 2009 12:41:48 -0600 Subject: [ofa-general] [PATCH] mstvpd (resend) Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E7468409D@MNEXMB1.qlogic.org> We've repeatedly run into a problem where mstvpd can hang on certain HCA models, and on HCAs that have failed. This is an issue for us, because mstvpd is one of the tools we use to automatically capture information about a system that's experiencing problems. I previously opened PR 1440 on the problem, but it doesn't appear to have been investigated. For this reason, I'm proposing the attached patch. Basically, it adds a configurable time out and it terminates the attempt to read the VPD area if it fails to retrieve data before the time out expires. The default is 30 seconds. It uses a stupid busy-loop to check for time out because that's what the existing code does. Other changes were also made to support this change - I changed how command line options are processed and extended the usage() function. --- vpd.c.orig 2009-01-08 16:56:12.000000000 -0500 +++ vpd.c 2009-01-08 17:44:01.000000000 -0500 @@ -44,6 +44,13 @@ #include #include #include +#include + +/* pread is non-blocking, so we loop until we find data. Unfortunately, + * we can loop forever if the HCA is crashed or if the wrong device is + * specified as an argument. So, we set time outs. + */ +static clock_t ticks_per_sec, start_t, curr_t, timeout_t = 30; struct vpd_cap { unsigned char id; @@ -168,7 +175,13 @@ if (ret != sizeof addr_flag) return ret; + start_t = times(NULL); while((addr_flag[1] & VPD_FLAG) != VPD_FLAG_READ_READY) { + curr_t = times(NULL); + if ((curr_t - start_t) / ticks_per_sec > timeout_t) { + return -EIO; + } + ret = pread(device, addr_flag, sizeof addr_flag, vpd_cap_offset + VPD_ADDR_OFFSET); if (ret != sizeof addr_flag) @@ -437,24 +450,34 @@ rc = 1; goto usage; } - if (argc == 3) { - if (!strcmp("-m", argv[1])) { - argv++; - argc--; - m = 1; - } else if (!strcmp("-n", argv[1])) { - argv++; - argc--; - n = 1; - } else { - rc = 2; - goto usage; + + ticks_per_sec = sysconf(_SC_CLK_TCK); + + do + { + i=getopt(argc, argv, "mnt:"); + if (i<0) { + break; } - } - name = argv[1]; - argv++; - argc--; + switch (i) { + case 'm': + m=1; + break; + case 'n': + n=1; + break; + case 't': + timeout_t = strtol(optarg, NULL, 0); + break; + default: + goto usage; + } + } while (1 == 1); + + name = argv[optind]; + argc -= optind; + argv += optind; if (!strcmp("-", name)) { if (fread(d, VPD_MAX_SIZE, 1, stdin) != 1) @@ -486,6 +509,14 @@ return 0; usage: - fprintf(stderr, "Usage: %s [-m|-n] [-- keyword ...]\n", argv[0]); + fprintf(stderr, "Usage: %s [-m|-n] [-t ##] [-- keyword ...]\n", argv[0]); + fprintf(stderr, "-m\tDump raw VPD data to stdout.\n"); + fprintf(stderr, "-n\tDo not validate check sum.\n"); + fprintf(stderr, "-t ##\tTime out after ## seconds. (Default is 30.)\n\n"); + fprintf(stderr, "file\tThe PCI id number of the HCA (for example, \"2:00.0\"),\n"); + fprintf(stderr, "\tthe device name (such as \"mlx4_0\")\n"); + fprintf(stderr, "\tthe absolute path to the device (\"/sys/class/infiniband/mlx4_0/device\")\n"); + fprintf(stderr, "\tor '-' to read VPD data from the standard input.\n\n"); + fprintf(stderr, "keyword(s): Only display the requested information. (ID, PN, EC, SN, etc...)\n"); return rc; } -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mstvpd.patch Type: application/octet-stream Size: 2501 bytes Desc: mstvpd.patch URL: From sashak at voltaire.com Wed Jan 14 10:54:46 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jan 2009 20:54:46 +0200 Subject: [ofa-general] [PATCH v3 2/2] infiniband-diags support PortXmitWait get and set In-Reply-To: <496CABB9.4030402@gmail.com> References: <4950E07F.6090104@gmail.com> <496CA539.1040204@gmail.com> <496CA602.1060600@gmail.com> <496CABB9.4030402@gmail.com> Message-ID: <20090114185446.GJ1441@sashak.voltaire.com> Hi Eli, On 16:56 Tue 13 Jan , Eli Dorfman (Voltaire) wrote: > support PortXmitWait get and set > fix syntax error > > Signed-off-by: Eli Dorfman > --- > infiniband-diags/src/perfquery.c | 14 +++++++++++++- > libibmad/src/gs.c | 2 ++ > 2 files changed, 15 insertions(+), 1 deletions(-) > > diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c > index 41a8b74..5e5b3ed 100644 > --- a/infiniband-diags/src/perfquery.c > +++ b/infiniband-diags/src/perfquery.c > @@ -67,6 +67,7 @@ struct perf_count { > uint32_t rcvdata; > uint32_t xmtpkts; > uint32_t rcvpkts; > + uint32_t xmtwait; > }; > > struct perf_count_ext { > @@ -209,6 +210,8 @@ static void aggregate_perfcounters(void) > aggregate_32bit(&perf_count.xmtpkts, val); > mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); > aggregate_32bit(&perf_count.rcvpkts, val); > + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); Please use tab character for indentation. > + aggregate_32bit(&perf_count.xmtwait, val); > } > > static void output_aggregate_perfcounters(ib_portid_t *portid) > @@ -235,6 +238,7 @@ static void output_aggregate_perfcounters(ib_portid_t *portid) > mad_encode_field(pc, IB_PC_RCV_BYTES_F, &perf_count.rcvdata); > mad_encode_field(pc, IB_PC_XMT_PKTS_F, &perf_count.xmtpkts); > mad_encode_field(pc, IB_PC_RCV_PKTS_F, &perf_count.rcvpkts); > + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); > > mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); > > @@ -298,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p > if (extended != 1) { > if (!port_performance_query(pc, portid, port, timeout)) > IBERROR("perfquery"); > + if (!(cap_mask & 0x1000)) { > + /* if PortCounters:PortXmitWait not suppported clear this counter */ > + perf_count.xmtwait = 0; > + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); > + } Is this a good thing to hide the reported value? We could to not show XmitWait at all in case when it is not supported, or to show it as was reported by port and not "to lie" about zero value. > if (aggregate) > aggregate_perfcounters(); > - else > + else > mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); > } else { > if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ > @@ -500,6 +509,9 @@ main(int argc, char **argv) > > do_reset: > > + if (!extended && (cap_mask & 0x1000)) > + mask |= (1<<16); /* reset portxmitwait */ The counter mask can be specified by user in command line to reset only certain counters. The code above will add XmitWait unconditionally regardless to user wishes. So shouldn't it be something like: if (argc <= 2 && !extended && (cap_mask & 0x1000)) mask |= (1<<16); /* reset portxmitwait */ ? Sasha > + > if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { > for (i = start_port; i <= num_ports; i++) > reset_counters(extended, timeout, mask, &portid, i); > diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c > index d350c0d..30f00fb 100644 > --- a/libibmad/src/gs.c > +++ b/libibmad/src/gs.c > @@ -142,6 +142,8 @@ performance_reset_via(void *rcvbuf, ib_portid_t *dest, int port, unsigned mask, > /* Same for attribute IDs */ > mad_set_field(rcvbuf, 0, IB_PC_PORT_SELECT_F, port); > mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT_F, mask); > + mask = mask >> 16; > + mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT2_F, mask); > rpc.attr.mod = 0; > rpc.timeout = timeout; > rpc.datasz = IB_PC_DATA_SZ; > -- > 1.5.5 > From yosefe at Voltaire.COM Wed Jan 14 10:51:40 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 14 Jan 2009 20:51:40 +0200 Subject: [ofa-general] Re: [PATCH v2] IPoIB: refresh paths that might be invalid In-Reply-To: <496DCE51.6070808@Voltaire.COM> References: <496CB6DE.5090100@Voltaire.COM> <496CB89C.7020108@Voltaire.COM> <496DCE51.6070808@Voltaire.COM> Message-ID: <496E343C.3080903@Voltaire.COM> How about marking neighbours invalid (instead of paths) then you could mark all neighbours invalid on flush events, and in ipoib_start_xmit() add neigh->valid to the gid memcmp condition, so that if neigh->valid is 0 ipoib will free it, and call ipoib_path_loookup()? Moni Shoua wrote: > If a standby SM takes over and if only some of the nodes change their LID as a > result, the other nodes get an IPOIB_FLUSH_LIGHT event that doesn't cause > flushing of paths but only marks them as probably invalid. Path refresh > will happen only after an ARP probe which may take some time to happen (tens of seconds). > > This patch adds a task that is responsible to restart the lookup of possibly > invalid paths in 2 occasions: > - handling of IPOIB_FLUSH_LIGHT event > - when path completion returns with bad status. > > This way paths are being refreshed much faster. > > Signed-off-by: Moni Shoua > --- > drivers/infiniband/ulp/ipoib/ipoib.h | 6 +++- > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 2 - > drivers/infiniband/ulp/ipoib/ipoib_main.c | 38 ++++++++++++++++++++++++++---- > 3 files changed, 38 insertions(+), 8 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h > index 753a983..89f574c 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > @@ -298,6 +298,7 @@ struct ipoib_dev_priv { > struct work_struct flush_heavy; > struct work_struct restart_task; > struct delayed_work ah_reap_task; > + struct delayed_work path_refresh_task; > > struct ib_device *ca; > u8 port; > @@ -378,7 +379,7 @@ struct ipoib_path { > > struct rb_node rb_node; > struct list_head list; > - int valid; > + u8 stale; > }; > > struct ipoib_neigh { > @@ -442,8 +443,9 @@ int ipoib_add_umcast_attr(struct net_device *dev); > void ipoib_send(struct net_device *dev, struct sk_buff *skb, > struct ipoib_ah *address, u32 qpn); > void ipoib_reap_ah(struct work_struct *work); > +void ipoib_refresh_paths(struct work_struct *work); > > -void ipoib_mark_paths_invalid(struct net_device *dev); > +void ipoib_mark_paths_stale(struct net_device *dev); > void ipoib_flush_paths(struct net_device *dev); > struct ipoib_dev_priv *ipoib_intf_alloc(const char *format); > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index a192581..1a50076 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -962,7 +962,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, > } > > if (level == IPOIB_FLUSH_LIGHT) { > - ipoib_mark_paths_invalid(dev); > + ipoib_mark_paths_stale(dev); > ipoib_mcast_dev_flush(dev); > } > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index dce0443..b2ec845 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -352,7 +352,7 @@ void ipoib_path_iter_read(struct ipoib_path_iter *iter, > > #endif /* CONFIG_INFINIBAND_IPOIB_DEBUG */ > > -void ipoib_mark_paths_invalid(struct net_device *dev) > +void ipoib_mark_paths_stale(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ipoib_path *path, *tp; > @@ -360,11 +360,15 @@ void ipoib_mark_paths_invalid(struct net_device *dev) > spin_lock_irq(&priv->lock); > > list_for_each_entry_safe(path, tp, &priv->path_list, list) { > - ipoib_dbg(priv, "mark path LID 0x%04x GID %pI6 invalid\n", > + ipoib_dbg(priv, "mark path LID 0x%04x GID %pI6 stale\n", > be16_to_cpu(path->pathrec.dlid), > path->pathrec.dgid.raw); > - path->valid = 0; > + path->stale = 1; > } > + if (!list_empty(&priv->path_list)) > + queue_delayed_work(ipoib_workqueue, &priv->path_refresh_task, > + round_jiffies_relative(HZ)); > + > > spin_unlock_irq(&priv->lock); > } > @@ -427,6 +431,10 @@ static void path_rec_completion(int status, > > if (!ib_init_ah_from_path(priv->ca, priv->port, pathrec, &av)) > ah = ipoib_create_ah(dev, priv->pd, &av); > + } else { > + path->stale = 1; > + queue_delayed_work(ipoib_workqueue, &priv->path_refresh_task, > + round_jiffies_relative(HZ)); > } > > spin_lock_irqsave(&priv->lock, flags); > @@ -477,7 +485,6 @@ static void path_rec_completion(int status, > while ((skb = __skb_dequeue(&neigh->queue))) > __skb_queue_tail(&skqueue, skb); > } > - path->valid = 1; > } > > path->query = NULL; > @@ -551,9 +558,29 @@ static int path_rec_start(struct net_device *dev, > return path->query_id; > } > > + path->stale = 0; > return 0; > } > > +void ipoib_refresh_paths(struct work_struct *work) > +{ > + struct ipoib_dev_priv *priv = > + container_of(work, struct ipoib_dev_priv, path_refresh_task.work); > + struct net_device *dev = priv->dev; > + struct ipoib_path *path, *tp; > + > + spin_lock_irq(&priv->lock); > + list_for_each_entry_safe(path, tp, &priv->path_list, list) { > + ipoib_dbg(priv, "restart path LID 0x%04x GID %pI6\n", > + be16_to_cpu(path->pathrec.dlid), > + path->pathrec.dgid.raw); > + if (path->stale) > + path_rec_start(dev, path); > + } > + > + spin_unlock_irq(&priv->lock); > +} > + > static void neigh_add_path(struct sk_buff *skb, struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > @@ -656,7 +683,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev, > spin_lock_irqsave(&priv->lock, flags); > > path = __path_find(dev, phdr->hwaddr + 4); > - if (!path || !path->valid) { > + if (!path) { > if (!path) > path = path_rec_create(dev, phdr->hwaddr + 4); > if (path) { > @@ -1070,6 +1097,7 @@ static void ipoib_setup(struct net_device *dev) > INIT_WORK(&priv->flush_heavy, ipoib_ib_dev_flush_heavy); > INIT_WORK(&priv->restart_task, ipoib_mcast_restart_task); > INIT_DELAYED_WORK(&priv->ah_reap_task, ipoib_reap_ah); > + INIT_DELAYED_WORK(&priv->path_refresh_task, ipoib_refresh_paths); > } > > struct ipoib_dev_priv *ipoib_intf_alloc(const char *name) > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- --Yossi From yosefe at Voltaire.COM Wed Jan 14 11:04:39 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 14 Jan 2009 21:04:39 +0200 Subject: [ofa-general] Re: [PATCH v2] ipoib: fix a deadlock between ipoib start/stop and child interface create/delete In-Reply-To: References: <495B6B8B.50803@Voltaire.COM> Message-ID: <496E3747.7040507@Voltaire.COM> Roland Dreier wrote: > I think this almost works, but: > > > + list_for_each_entry(cpriv, &priv->child_intfs, list) { > > + flags = cpriv->dev->flags; > > + new_flags = (flags & ~IFF_UP) | iffup_value; > > + if (flags != new_flags) { > > + rtnl_lock(); > > + dev_change_flags(cpriv->dev, new_flags); > > + rtnl_unlock(); > > + } > > + } > > taking flags outside of the rtnl lock looks dubious to me, since it > could change before we get to the dev_change_flags() call. Yes, you are right. All bitwise stuff should be inside the lock. > > Looking at all this old code, I have to wonder whether anyone is > depending on bringing up the main interface also bringing up all the > subinterfaces ... the simplest solution would be to let the > subinterfaces be independent. Is there anything wrong with just > deleting the code to bring subinterfaces up/down? > Deleting the code would be the simplest thing to do, but on the other hand the fact the the subinterfaces are up when the master is brought up is a behaviour that the users/userspace scripts got used to so I didn't want to change that. Vlad, what do you think? How will openibd script work with this? From sashak at voltaire.com Wed Jan 14 11:21:17 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jan 2009 21:21:17 +0200 Subject: [ofa-general] [PATCH v2 1/2] opensm: Add new partition keyword for all hca, switches and routers In-Reply-To: <496CBA91.9000008@gmail.com> References: <495CE3F8.9080506@gmail.com> <20090107163521.GH11759@sashak.voltaire.com> <496CBA24.2080902@gmail.com> <496CBA91.9000008@gmail.com> Message-ID: <20090114192117.GK1441@sashak.voltaire.com> On 18:00 Tue 13 Jan , Eli Dorfman (Voltaire) wrote: > Add new partition keyword for node type groups. > The following new keywords were added: > 'ALL_CAS' means all CA end ports in the subnet. > 'ALL_SWITCHES' means all Switch end ports in the subnet > 'ALL_ROUTERS' means all Router end ports in the subnet > For example, to allow firmware upgrade within managed switches we > need that all switch port 0 to have full membership. > > Signed-off-by: Eli Dorfman Applied. Thanks. I changed this slightly - node type '0' will mean 'all nodes', so changed 'uint8_t type' osm_prtn_add_all() parameter to 'unsigned type'. Sasha From sashak at voltaire.com Wed Jan 14 11:28:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jan 2009 21:28:56 +0200 Subject: [ofa-general] [PATCH v2 2/2] docs update documenatation about new partition keywords In-Reply-To: <496CBACF.7030907@gmail.com> References: <495CE3F8.9080506@gmail.com> <20090107163521.GH11759@sashak.voltaire.com> <496CBA24.2080902@gmail.com> <496CBACF.7030907@gmail.com> Message-ID: <20090114192848.GL1441@sashak.voltaire.com> On 18:01 Tue 13 Jan , Eli Dorfman (Voltaire) wrote: > update documenatation about new partition keywords > 'ALL_CAS', 'ALL_SWITCHES', 'ALL_ROUTERS' > > Signed-off-by: Eli Dorfman Applied. Thanks. Also with small change - I added examples with ALL_SWITCHES keyword instead of replacing original lines. Sasha From yosefe at Voltaire.COM Wed Jan 14 11:31:19 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 14 Jan 2009 21:31:19 +0200 Subject: [ofa-general] Re: About create_qp_flags merging In-Reply-To: References: Message-ID: <496E3D87.1020301@Voltaire.COM> We want to expose to userspace (and hook to in libibverbs) the ability to block multicast loopback for a QP. This will enable userspace applications to get better performance. Ron's patchset does it by exposing it in ib_uverbs, then posting the command from libibverbs, and going through limlx4. Is it possible to do it in a better way? What is the status of XRC patches? AFAIK they were submitted at least six months ago, but still not merged. Is uverbs/libibverbs/libmlx4 freezed until they will be merged? --Yossi Roland Dreier wrote: > > You've told me that you're not going to merge the create_qp_flags > > patches because they are stuck behind XRC. > > > > However, I reposted new patches in December that don't rely on the XRC > > patches. > > a) your patches look pointless, because they add a new userspace > interface and then don't actually hook anything in libibverbs up to > the new interface. > > b) is there any reason why your patches are so important that I should > skip merging the XRC work that was submitted before yours and jump to > apply yours first? > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- --Yossi From sashak at voltaire.com Wed Jan 14 11:36:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 14 Jan 2009 21:36:12 +0200 Subject: [ofa-general] Re: [PATCH] opensm: update LFTs when entering master state In-Reply-To: <496A67D8.9070308@dev.mellanox.co.il> References: <20090111204927.GG1441@sashak.voltaire.com> <496A67D8.9070308@dev.mellanox.co.il> Message-ID: <20090114193612.GM1441@sashak.voltaire.com> Hi Yevgeny, On 23:42 Sun 11 Jan , Yevgeny Kliteynik wrote: > Hi Sasha, > > Sasha Khapyorsky wrote: >> When we are going to setup LFTs we need to ignore its previous images if >> OpenSM enters master after standby, so need to check for subnet >> need_update flag too. > > Nice catch. I think there will be a similar problem with > cached routing too - need to invalidate the cache when SM > enters master state. Yes, obviously. Maybe something like this will do: diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 625e026..fc7ceb9 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1089,7 +1089,7 @@ static void do_sweep(osm_sm_t * sm) */ if (sm->p_subn->opt.use_ucast_cache && (sm->p_subn->subnet_initialization_error || - sm->p_subn->force_reroute)) + sm->p_subn->force_reroute || sm->p_subn->coming_out_of_standby)) osm_ucast_cache_invalidate(&sm->ucast_mgr); /* (and basically I think tah all those "flags flow" mess requires cleanup already :)). Sasha From yosefe at Voltaire.COM Wed Jan 14 11:53:58 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 14 Jan 2009 21:53:58 +0200 Subject: [ofa-general] Re: [PATCH] ipoib: don't enable napi when it's already enabled In-Reply-To: References: <48FB32CF.6060202@Voltaire.COM> Message-ID: <496E42D6.3020107@Voltaire.COM> Well, now after I sent and you applied a patch that fixes the issue by moving napi_enable() after ipoib_pkey_dev_delay_open(), it seems to be broken. If the pkey is never found, ipoib_open() will not be called again and napi will never be enabled. Then, in napi_disable(), it gets stuck in an infinite loop (polling for NAPI_STATE_SCHED bit). I think the fix (to the fix) should be like the one in the patch below. Roland Dreier wrote: > > - napi_enable(&priv->napi); > > - set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); > > + if (!test_and_set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > > + napi_enable(&priv->napi); > > > > if (ipoib_pkey_dev_delay_open(dev)) > > return 0; > > Does it work just to move the napi_enable() to after the > ipoib_pkey_dev_delay_open() test? > > - R. From rdreier at cisco.com Wed Jan 14 12:18:23 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Jan 2009 12:18:23 -0800 Subject: [ofa-general] Re: About create_qp_flags merging In-Reply-To: <496E3D87.1020301@Voltaire.COM> (Yossi Etigin's message of "Wed, 14 Jan 2009 21:31:19 +0200") References: <496E3D87.1020301@Voltaire.COM> Message-ID: > What is the status of XRC patches? AFAIK they were submitted at least six months > ago, but still not merged. Is uverbs/libibverbs/libmlx4 freezed until > they will be merged? I hope to get time to finish fixing up and reviewing the XRC changes for 2.6.30. I don't want to merge anything else that requires libibverbs ABI changes until that's done. - R. From rdreier at cisco.com Wed Jan 14 12:19:28 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Jan 2009 12:19:28 -0800 Subject: [ofa-general] Re: [PATCH] ipoib: don't enable napi when it's already enabled In-Reply-To: <496E42D6.3020107@Voltaire.COM> (Yossi Etigin's message of "Wed, 14 Jan 2009 21:53:58 +0200") References: <48FB32CF.6060202@Voltaire.COM> <496E42D6.3020107@Voltaire.COM> Message-ID: > I think the fix (to the fix) should be like the one in the patch below. I don't see a patch below? From yosefe at Voltaire.COM Wed Jan 14 13:16:54 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Wed, 14 Jan 2009 23:16:54 +0200 Subject: [ofa-general] Re: [PATCH] ipoib: don't enable napi when it's already enabled In-Reply-To: References: <48FB32CF.6060202@Voltaire.COM> <496E42D6.3020107@Voltaire.COM> Message-ID: <496E5646.1080006@Voltaire.COM> Roland Dreier wrote: > > I think the fix (to the fix) should be like the one in the patch below. > > I don't see a patch below? I meant the quoted one (of course it should be rebased on the new code): Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-19 14:12:55.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-19 14:16:16.000000000 +0200 @@ -106,8 +106,8 @@ int ipoib_open(struct net_device *dev) ipoib_dbg(priv, "bringing up interface\n"); - napi_enable(&priv->napi); - set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + if (!test_and_set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + napi_enable(&priv->napi); if (ipoib_pkey_dev_delay_open(dev)) return 0; -- From kliteyn at dev.mellanox.co.il Wed Jan 14 14:04:18 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 15 Jan 2009 00:04:18 +0200 Subject: [ofa-general] Re: [PATCH] opensm: update LFTs when entering master state In-Reply-To: <20090114193612.GM1441@sashak.voltaire.com> References: <20090111204927.GG1441@sashak.voltaire.com> <496A67D8.9070308@dev.mellanox.co.il> <20090114193612.GM1441@sashak.voltaire.com> Message-ID: <496E6162.3060208@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 23:42 Sun 11 Jan , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> Sasha Khapyorsky wrote: >>> When we are going to setup LFTs we need to ignore its previous images if >>> OpenSM enters master after standby, so need to check for subnet >>> need_update flag too. >> Nice catch. I think there will be a similar problem with >> cached routing too - need to invalidate the cache when SM >> enters master state. > > Yes, obviously. Maybe something like this will do: > > > diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c > index 625e026..fc7ceb9 100644 > --- a/opensm/opensm/osm_state_mgr.c > +++ b/opensm/opensm/osm_state_mgr.c > @@ -1089,7 +1089,7 @@ static void do_sweep(osm_sm_t * sm) > */ > if (sm->p_subn->opt.use_ucast_cache && > (sm->p_subn->subnet_initialization_error || > - sm->p_subn->force_reroute)) > + sm->p_subn->force_reroute || sm->p_subn->coming_out_of_standby)) Sure, looks like it would do the job. > osm_ucast_cache_invalidate(&sm->ucast_mgr); > > /* > > > (and basically I think tah all those "flags flow" mess requires cleanup > already :)). Oh yes... -- Yevgeny > Sasha > From rdreier at cisco.com Wed Jan 14 14:56:15 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Jan 2009 14:56:15 -0800 Subject: [ofa-general] Re: [PATCH] ipoib: don't enable napi when it's already enabled In-Reply-To: <496E42D6.3020107@Voltaire.COM> (Yossi Etigin's message of "Wed, 14 Jan 2009 21:53:58 +0200") References: <48FB32CF.6060202@Voltaire.COM> <496E42D6.3020107@Voltaire.COM> Message-ID: Oh, I see, thanks. I think you're right about how to fix this, so I'll queue up the following: After commit fe25c561 ("IPoIB: Don't enable NAPI when it's already enabled"), if an interface is brought up but the corresponding P_Key never appears, then ipoib_stop() will hang in napi_disable(), because ipoib_open() returns before it does napi_enable(). Fix this by changing ipoib_open() to call napi_enable() even if the P_Key isn't present. Reported-by: Yossi Etigin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 27 +++++++++++++++------------ 1 files changed, 15 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index dce0443..0bd2a4f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -106,23 +106,17 @@ int ipoib_open(struct net_device *dev) ipoib_dbg(priv, "bringing up interface\n"); - set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + if (!test_and_set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + napi_enable(&priv->napi); if (ipoib_pkey_dev_delay_open(dev)) return 0; - napi_enable(&priv->napi); + if (ipoib_ib_dev_open(dev)) + goto err_disable; - if (ipoib_ib_dev_open(dev)) { - napi_disable(&priv->napi); - return -EINVAL; - } - - if (ipoib_ib_dev_up(dev)) { - ipoib_ib_dev_stop(dev, 1); - napi_disable(&priv->napi); - return -EINVAL; - } + if (ipoib_ib_dev_up(dev)) + goto err_stop; if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) { struct ipoib_dev_priv *cpriv; @@ -144,6 +138,15 @@ int ipoib_open(struct net_device *dev) netif_start_queue(dev); return 0; + +err_stop: + ipoib_ib_dev_stop(dev, 1); + +err_disable: + napi_disable(&priv->napi); + clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + + return -EINVAL; } static int ipoib_stop(struct net_device *dev) -- 1.6.0.4 From fengfeixuewu at 163.com Wed Jan 14 19:02:51 2009 From: fengfeixuewu at 163.com (songzhonglei) Date: Thu, 15 Jan 2009 11:02:51 +0800 Subject: [ofa-general] install ofed on xen Message-ID: <200901151102470151942@163.com> hi, i am trying to install ofed on debian with the kernel > of 2.6.18-6-xen-686,but meet such errors>>>>>>>>>>> > > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18/include/linux/mm.h: > In function 'is_vmalloc_addr': > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18/include/linux/mm.h:21: > error: 'PKMAP_BASE' undeclared (first use in this function) > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18/include/linux/mm.h:21: > error: (Each undeclared identifier is reported only once > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18/include/linux/mm.h:21: > error: for each function it appears in.) what's the problem? thank you for advance. 2009-01-15 songzhonglei -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Jan 14 21:49:22 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 14 Jan 2009 21:49:22 -0800 Subject: [ofa-general] Re: [PATCH v2] ipoib: fix a deadlock between ipoib start/stop and child interface create/delete In-Reply-To: <495B6B8B.50803@Voltaire.COM> (Yossi Etigin's message of "Wed, 31 Dec 2008 14:54:35 +0200") References: <495B6B8B.50803@Voltaire.COM> Message-ID: I thought about this some more, and I realize we can just always take the vlan_mutex after the RTNL rather than having to add another delayed work item or anything like that. So I queued up the patch below -- the issue you found causes a clear lockdep warning, and this patch fixes that for me, so I'm confident this fixes the problem. --- IPoIB: Fix deadlock between ipoib_open() and child interface create Fix a deadlock between child interface creation/deletion and ipoib start/stop. The former takes vlan_mutex, and then might take RTNL via register_netdev()/unregister_netdev(). The latter is executed with RTNL held, and tries to take vlan_mutex, which can lead to an AB-BA deadlock. Fix this by having the child interface creation/deletion code take the RTNL first so vlan_mutex always nests inside RTNL. We can use register_netdevice() for child interfaces because we form the interface name from the parent interface and hence don't need the '%' expansion of register_netdev(). Reported-by: Yossi Etigin Signed-off-by: Roland Dreier --- drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index 2cf1a40..5a76a55 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -61,6 +61,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) ppriv = netdev_priv(pdev); + rtnl_lock(); mutex_lock(&ppriv->vlan_mutex); /* @@ -111,7 +112,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) goto device_init_failed; } - result = register_netdev(priv->dev); + result = register_netdevice(priv->dev); if (result) { ipoib_warn(priv, "failed to initialize; error %i", result); goto register_failed; @@ -134,12 +135,13 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) list_add_tail(&priv->list, &ppriv->child_intfs); mutex_unlock(&ppriv->vlan_mutex); + rtnl_unlock(); return 0; sysfs_failed: ipoib_delete_debug_files(priv->dev); - unregister_netdev(priv->dev); + unregister_netdevice(priv->dev); register_failed: ipoib_dev_cleanup(priv->dev); @@ -149,6 +151,7 @@ device_init_failed: err: mutex_unlock(&ppriv->vlan_mutex); + rtnl_unlock(); return result; } @@ -162,10 +165,11 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey) ppriv = netdev_priv(pdev); + rtnl_lock(); mutex_lock(&ppriv->vlan_mutex); list_for_each_entry_safe(priv, tpriv, &ppriv->child_intfs, list) { if (priv->pkey == pkey) { - unregister_netdev(priv->dev); + unregister_netdevice(priv->dev); ipoib_dev_cleanup(priv->dev); list_del(&priv->list); free_netdev(priv->dev); @@ -175,6 +179,7 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey) } } mutex_unlock(&ppriv->vlan_mutex); + rtnl_unlock(); return ret; } -- 1.6.0.4 From dorfman.eli at gmail.com Wed Jan 14 22:43:23 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 15 Jan 2009 08:43:23 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c Fix memory leak for QOS string parameters. In-Reply-To: <20090114154200.GB1640@sashak.voltaire.com> References: <494E2E5A.8050008@gmail.com> <20090114154200.GB1640@sashak.voltaire.com> Message-ID: <496EDB0B.2090207@gmail.com> Sasha Khapyorsky wrote: > Hi Eli, > > On 13:54 Sun 21 Dec , Eli Dorfman (Voltaire) wrote: >> Fix memory leak for QOS string parameters. >> >> Signed-off-by: Slava Strebkov >> >> --- >> opensm/opensm/osm_subnet.c | 21 +++++++++++++++++++++ >> 1 files changed, 21 insertions(+), 0 deletions(-) >> >> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c >> index 122d4dd..f8b29f8 100644 >> --- a/opensm/opensm/osm_subnet.c >> +++ b/opensm/opensm/osm_subnet.c >> @@ -331,6 +331,21 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt) >> opt->sl2vl = NULL; >> } >> >> +static void subn_free_qos_options(IN osm_qos_options_t * opt) >> +{ >> + if ((opt->vlarb_high) && (opt->vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH)) { >> + free(opt->vlarb_high); >> + } >> + >> + if ((opt->vlarb_low) && (opt->vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW)) { >> + free(opt->vlarb_low); >> + } >> + >> + if ((opt->sl2vl) && (opt->sl2vl != OSM_DEFAULT_QOS_SL2VL)) { >> + free(opt->sl2vl); >> + } >> +} > > With gcc-4.3.2 using '-Wall' flag I get warning here: > > "comparison with string literal results in unspecified behavior" > > It is actually true since used OSM_DEFAULT_QOS_* macros are defined as > strings - it is something equal to: > > char *p = "123456"; > > .... > > if (p != "123456") > free(p1); > > Gcc is smart enough and uses the same string constant "123456" in both > cases (so your patch actually works), but don't think that it is > guaranteed in C language. > > So I think to rework this part - we can always allocate string qos config > parameters (just similar to other config), and free it when it is not > NULL. Something like this: > > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index a6db304..94b6332 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -333,13 +333,13 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt) > > static void subn_free_qos_options(IN osm_qos_options_t * opt) > { > - if (opt->vlarb_high && opt->vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH) > + if (opt->vlarb_high) > free(opt->vlarb_high); > > - if (opt->vlarb_low && opt->vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW) > + if (opt->vlarb_low) > free(opt->vlarb_low); > > - if (opt->sl2vl && opt->sl2vl != OSM_DEFAULT_QOS_SL2VL) > + if (opt->sl2vl) > free(opt->sl2vl); > } > > @@ -803,7 +803,7 @@ static void subn_verify_vlarb(char **vlarb, const char *prefix, > if (*vlarb == NULL) { > log_report(" Invalid Cached Option: %s_vlarb_%s: " > "Using Default\n", prefix, suffix); > - *vlarb = dflt; > + *vlarb = strdup(dflt); > return; > } > > @@ -872,7 +872,7 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt) > if (*sl2vl == NULL) { > log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n", > prefix); > - *sl2vl = dflt; > + *sl2vl = strdup(dflt); > return; > } > > > Thoughts? looks cleaner. will you commit this? Thanks, Eli From dorfman.eli at gmail.com Wed Jan 14 22:53:10 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Thu, 15 Jan 2009 08:53:10 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v3 2/2] infiniband-diags support PortXmitWait get and set In-Reply-To: <20090114185446.GJ1441@sashak.voltaire.com> References: <4950E07F.6090104@gmail.com> <496CA539.1040204@gmail.com> <496CA602.1060600@gmail.com> <496CABB9.4030402@gmail.com> <20090114185446.GJ1441@sashak.voltaire.com> Message-ID: <496EDD56.3070309@gmail.com> Sasha Khapyorsky wrote: > Hi Eli, > > On 16:56 Tue 13 Jan , Eli Dorfman (Voltaire) wrote: >> support PortXmitWait get and set >> fix syntax error >> >> Signed-off-by: Eli Dorfman >> --- >> infiniband-diags/src/perfquery.c | 14 +++++++++++++- >> libibmad/src/gs.c | 2 ++ >> 2 files changed, 15 insertions(+), 1 deletions(-) >> >> diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c >> index 41a8b74..5e5b3ed 100644 >> --- a/infiniband-diags/src/perfquery.c >> +++ b/infiniband-diags/src/perfquery.c >> @@ -67,6 +67,7 @@ struct perf_count { >> uint32_t rcvdata; >> uint32_t xmtpkts; >> uint32_t rcvpkts; >> + uint32_t xmtwait; >> }; >> >> struct perf_count_ext { >> @@ -209,6 +210,8 @@ static void aggregate_perfcounters(void) >> aggregate_32bit(&perf_count.xmtpkts, val); >> mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); >> aggregate_32bit(&perf_count.rcvpkts, val); >> + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); > > Please use tab character for indentation. > >> + aggregate_32bit(&perf_count.xmtwait, val); >> } >> >> static void output_aggregate_perfcounters(ib_portid_t *portid) >> @@ -235,6 +238,7 @@ static void output_aggregate_perfcounters(ib_portid_t *portid) >> mad_encode_field(pc, IB_PC_RCV_BYTES_F, &perf_count.rcvdata); >> mad_encode_field(pc, IB_PC_XMT_PKTS_F, &perf_count.xmtpkts); >> mad_encode_field(pc, IB_PC_RCV_PKTS_F, &perf_count.rcvpkts); >> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); >> >> mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); >> >> @@ -298,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p >> if (extended != 1) { >> if (!port_performance_query(pc, portid, port, timeout)) >> IBERROR("perfquery"); >> + if (!(cap_mask & 0x1000)) { >> + /* if PortCounters:PortXmitWait not suppported clear this counter */ >> + perf_count.xmtwait = 0; >> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); >> + } > > Is this a good thing to hide the reported value? We could to not show > XmitWait at all in case when it is not supported, or to show it as was > reported by port and not "to lie" about zero value. This is a good idea, but it requires to pass the mask to the mad_dump_perfcounters which is a generic function. I preferred to leave it like this than making a special case for perfcounters dump. > >> if (aggregate) >> aggregate_perfcounters(); >> - else >> + else >> mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); >> } else { >> if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ >> @@ -500,6 +509,9 @@ main(int argc, char **argv) >> >> do_reset: >> >> + if (!extended && (cap_mask & 0x1000)) >> + mask |= (1<<16); /* reset portxmitwait */ > > The counter mask can be specified by user in command line to reset only > certain counters. The code above will add XmitWait unconditionally > regardless to user wishes. So shouldn't it be something like: > > if (argc <= 2 && !extended && (cap_mask & 0x1000)) > mask |= (1<<16); /* reset portxmitwait */ > i agree. > > Sasha > >> + >> if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { >> for (i = start_port; i <= num_ports; i++) >> reset_counters(extended, timeout, mask, &portid, i); >> diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c >> index d350c0d..30f00fb 100644 >> --- a/libibmad/src/gs.c >> +++ b/libibmad/src/gs.c >> @@ -142,6 +142,8 @@ performance_reset_via(void *rcvbuf, ib_portid_t *dest, int port, unsigned mask, >> /* Same for attribute IDs */ >> mad_set_field(rcvbuf, 0, IB_PC_PORT_SELECT_F, port); >> mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT_F, mask); >> + mask = mask >> 16; >> + mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT2_F, mask); >> rpc.attr.mod = 0; >> rpc.timeout = timeout; >> rpc.datasz = IB_PC_DATA_SZ; >> -- >> 1.5.5 >> From monis at Voltaire.COM Wed Jan 14 23:53:55 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 15 Jan 2009 09:53:55 +0200 Subject: [ofa-general] Re: [PATCH v2] IPoIB: refresh paths that might be invalid In-Reply-To: <496E343C.3080903@Voltaire.COM> References: <496CB6DE.5090100@Voltaire.COM> <496CB89C.7020108@Voltaire.COM> <496DCE51.6070808@Voltaire.COM> <496E343C.3080903@Voltaire.COM> Message-ID: <496EEB93.5040201@Voltaire.COM> Yossi Etigin wrote: > How about marking neighbours invalid (instead of paths) > then you could mark all neighbours invalid on flush events, > and in ipoib_start_xmit() add neigh->valid to the gid memcmp > condition, so that if neigh->valid is 0 ipoib will free it, > and call ipoib_path_loookup()? > Destroying the neigh destroys the ah and if we are going this way we lose the benefits of FLUSH_LIGHT event which is keeping the ah with the assumption that it is still valid (which is the case most of the times) From monis at Voltaire.COM Wed Jan 14 23:57:38 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 15 Jan 2009 09:57:38 +0200 Subject: [ofa-general] [PATCH] mlx4_ib: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD0178B788@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD0178B788@mtlexch01.mtl.com> Message-ID: <496EEC72.7030004@Voltaire.COM> Jack Morgenstein wrote: > If we are generating a LID CHANGE event, why do we need the CLIENT > REREGISTER at all? > If both are generated, you will do BOTH light and heavy sweeps -- isn't > this a waste? > Preferable to check for LID CHANGE first (and generate the event if > needed). > Only if there is no LID CHANGE event generated, check for CLIENT REREG. > > Thoughts? > It looks OK to me. >From the point of view of IPoIB LID change event with a client re-register is pointless but is it the case for all others (current and future) that listen to the IB event channel? From gmpc at sanger.ac.uk Thu Jan 15 01:29:30 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Thu, 15 Jan 2009 09:29:30 +0000 Subject: [ofa-general] installing ofed-1.4 In-Reply-To: <18798.11261.165834.519441@ifh-tuww01.ifh.uni-karlsruhe.de> References: <18798.11261.165834.519441@ifh-tuww01.ifh.uni-karlsruhe.de> Message-ID: <496F01FA.6060003@sanger.ac.uk> Markus Uhlmann wrote: > Thanks Vladimir & Guy. > > Another small thing i noticed: is it correct that I manually need to > remove the infiniband-related modules already loaded from the "stock" > kernel(i.e. "rmmod" on ib_mthca,ib_mad,ib_core,mlx4_core) before > running "/etc/init.d/openibd start" ??? Yes, you need to rmmod the modules, but you do not have to remove them from /lib/modules/XXX. The new modules will automatically get picked up out of /lib/modules/XXXX/extra (assuming that depmod has been run). Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From vlad at lists.openfabrics.org Thu Jan 15 03:13:53 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 15 Jan 2009 03:13:53 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090115-0200 daily build status Message-ID: <20090115111353.5B505E60F1D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Thu Jan 15 05:54:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 Jan 2009 15:54:45 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_subnet.c Fix memory leak for QOS string parameters. In-Reply-To: <496EDB0B.2090207@gmail.com> References: <494E2E5A.8050008@gmail.com> <20090114154200.GB1640@sashak.voltaire.com> <496EDB0B.2090207@gmail.com> Message-ID: <20090115135445.GC1640@sashak.voltaire.com> On 08:43 Thu 15 Jan , Eli Dorfman (Voltaire) wrote: > Sasha Khapyorsky wrote: > > Hi Eli, > > > > On 13:54 Sun 21 Dec , Eli Dorfman (Voltaire) wrote: > >> Fix memory leak for QOS string parameters. > >> > >> Signed-off-by: Slava Strebkov > >> > >> --- > >> opensm/opensm/osm_subnet.c | 21 +++++++++++++++++++++ > >> 1 files changed, 21 insertions(+), 0 deletions(-) > >> > >> diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > >> index 122d4dd..f8b29f8 100644 > >> --- a/opensm/opensm/osm_subnet.c > >> +++ b/opensm/opensm/osm_subnet.c > >> @@ -331,6 +331,21 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt) > >> opt->sl2vl = NULL; > >> } > >> > >> +static void subn_free_qos_options(IN osm_qos_options_t * opt) > >> +{ > >> + if ((opt->vlarb_high) && (opt->vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH)) { > >> + free(opt->vlarb_high); > >> + } > >> + > >> + if ((opt->vlarb_low) && (opt->vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW)) { > >> + free(opt->vlarb_low); > >> + } > >> + > >> + if ((opt->sl2vl) && (opt->sl2vl != OSM_DEFAULT_QOS_SL2VL)) { > >> + free(opt->sl2vl); > >> + } > >> +} > > > > With gcc-4.3.2 using '-Wall' flag I get warning here: > > > > "comparison with string literal results in unspecified behavior" > > > > It is actually true since used OSM_DEFAULT_QOS_* macros are defined as > > strings - it is something equal to: > > > > char *p = "123456"; > > > > .... > > > > if (p != "123456") > > free(p1); > > > > Gcc is smart enough and uses the same string constant "123456" in both > > cases (so your patch actually works), but don't think that it is > > guaranteed in C language. > > > > So I think to rework this part - we can always allocate string qos config > > parameters (just similar to other config), and free it when it is not > > NULL. Something like this: > > > > > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > > index a6db304..94b6332 100644 > > --- a/opensm/opensm/osm_subnet.c > > +++ b/opensm/opensm/osm_subnet.c > > @@ -333,13 +333,13 @@ static void subn_init_qos_options(IN osm_qos_options_t * opt) > > > > static void subn_free_qos_options(IN osm_qos_options_t * opt) > > { > > - if (opt->vlarb_high && opt->vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH) > > + if (opt->vlarb_high) > > free(opt->vlarb_high); > > > > - if (opt->vlarb_low && opt->vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW) > > + if (opt->vlarb_low) > > free(opt->vlarb_low); > > > > - if (opt->sl2vl && opt->sl2vl != OSM_DEFAULT_QOS_SL2VL) > > + if (opt->sl2vl) > > free(opt->sl2vl); > > } > > > > @@ -803,7 +803,7 @@ static void subn_verify_vlarb(char **vlarb, const char *prefix, > > if (*vlarb == NULL) { > > log_report(" Invalid Cached Option: %s_vlarb_%s: " > > "Using Default\n", prefix, suffix); > > - *vlarb = dflt; > > + *vlarb = strdup(dflt); > > return; > > } > > > > @@ -872,7 +872,7 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt) > > if (*sl2vl == NULL) { > > log_report(" Invalid Cached Option: %s_sl2vl: Using Default\n", > > prefix); > > - *sl2vl = dflt; > > + *sl2vl = strdup(dflt); > > return; > > } > > > > > > Thoughts? > > looks cleaner. will you commit this? I will. Sasha From sashak at voltaire.com Thu Jan 15 05:58:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 15 Jan 2009 15:58:24 +0200 Subject: [ofa-general] [PATCH v3 2/2] infiniband-diags support PortXmitWait get and set In-Reply-To: <496EDD56.3070309@gmail.com> References: <4950E07F.6090104@gmail.com> <496CA539.1040204@gmail.com> <496CA602.1060600@gmail.com> <496CABB9.4030402@gmail.com> <20090114185446.GJ1441@sashak.voltaire.com> <496EDD56.3070309@gmail.com> Message-ID: <20090115135814.GD1640@sashak.voltaire.com> On 08:53 Thu 15 Jan , Eli Dorfman (Voltaire) wrote: > Sasha Khapyorsky wrote: > > Hi Eli, > > > > On 16:56 Tue 13 Jan , Eli Dorfman (Voltaire) wrote: > >> support PortXmitWait get and set > >> fix syntax error > >> > >> Signed-off-by: Eli Dorfman > >> --- > >> infiniband-diags/src/perfquery.c | 14 +++++++++++++- > >> libibmad/src/gs.c | 2 ++ > >> 2 files changed, 15 insertions(+), 1 deletions(-) > >> > >> diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c > >> index 41a8b74..5e5b3ed 100644 > >> --- a/infiniband-diags/src/perfquery.c > >> +++ b/infiniband-diags/src/perfquery.c > >> @@ -67,6 +67,7 @@ struct perf_count { > >> uint32_t rcvdata; > >> uint32_t xmtpkts; > >> uint32_t rcvpkts; > >> + uint32_t xmtwait; > >> }; > >> > >> struct perf_count_ext { > >> @@ -209,6 +210,8 @@ static void aggregate_perfcounters(void) > >> aggregate_32bit(&perf_count.xmtpkts, val); > >> mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); > >> aggregate_32bit(&perf_count.rcvpkts, val); > >> + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); > > > > Please use tab character for indentation. > > > >> + aggregate_32bit(&perf_count.xmtwait, val); > >> } > >> > >> static void output_aggregate_perfcounters(ib_portid_t *portid) > >> @@ -235,6 +238,7 @@ static void output_aggregate_perfcounters(ib_portid_t *portid) > >> mad_encode_field(pc, IB_PC_RCV_BYTES_F, &perf_count.rcvdata); > >> mad_encode_field(pc, IB_PC_XMT_PKTS_F, &perf_count.xmtpkts); > >> mad_encode_field(pc, IB_PC_RCV_PKTS_F, &perf_count.rcvpkts); > >> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); > >> > >> mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); > >> > >> @@ -298,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p > >> if (extended != 1) { > >> if (!port_performance_query(pc, portid, port, timeout)) > >> IBERROR("perfquery"); > >> + if (!(cap_mask & 0x1000)) { > >> + /* if PortCounters:PortXmitWait not suppported clear this counter */ > >> + perf_count.xmtwait = 0; > >> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); > >> + } > > > > Is this a good thing to hide the reported value? We could to not show > > XmitWait at all in case when it is not supported, or to show it as was > > reported by port and not "to lie" about zero value. > > This is a good idea, but it requires to pass the mask to the mad_dump_perfcounters > which is a generic function. > I preferred to leave it like this than making a special case for perfcounters dump. So I'm removing this if ()...? Sasha > > > > >> if (aggregate) > >> aggregate_perfcounters(); > >> - else > >> + else > >> mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); > >> } else { > >> if (!(cap_mask & 0x200)) /* 1.2 errata: bit 9 is extended counter support */ > >> @@ -500,6 +509,9 @@ main(int argc, char **argv) > >> > >> do_reset: > >> > >> + if (!extended && (cap_mask & 0x1000)) > >> + mask |= (1<<16); /* reset portxmitwait */ > > > > The counter mask can be specified by user in command line to reset only > > certain counters. The code above will add XmitWait unconditionally > > regardless to user wishes. So shouldn't it be something like: > > > > if (argc <= 2 && !extended && (cap_mask & 0x1000)) > > mask |= (1<<16); /* reset portxmitwait */ > > > > i agree. > > > > > > Sasha > > > >> + > >> if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { > >> for (i = start_port; i <= num_ports; i++) > >> reset_counters(extended, timeout, mask, &portid, i); > >> diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c > >> index d350c0d..30f00fb 100644 > >> --- a/libibmad/src/gs.c > >> +++ b/libibmad/src/gs.c > >> @@ -142,6 +142,8 @@ performance_reset_via(void *rcvbuf, ib_portid_t *dest, int port, unsigned mask, > >> /* Same for attribute IDs */ > >> mad_set_field(rcvbuf, 0, IB_PC_PORT_SELECT_F, port); > >> mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT_F, mask); > >> + mask = mask >> 16; > >> + mad_set_field(rcvbuf, 0, IB_PC_COUNTER_SELECT2_F, mask); > >> rpc.attr.mod = 0; > >> rpc.timeout = timeout; > >> rpc.datasz = IB_PC_DATA_SZ; > >> -- > >> 1.5.5 > >> > From hal.rosenstock at gmail.com Thu Jan 15 06:40:40 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 15 Jan 2009 09:40:40 -0500 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] opensm/osm_port_info_rcv.c: don't clear sw->need_update if port 0 is active In-Reply-To: <496A691C.2000809@dev.mellanox.co.il> References: <495C8576.9060004@dev.mellanox.co.il> <20090107162507.GG11759@sashak.voltaire.com> <496A691C.2000809@dev.mellanox.co.il> Message-ID: On Sun, Jan 11, 2009 at 4:48 PM, Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: >> >> On 10:57 Thu 01 Jan , Yevgeny Kliteynik wrote: >>> >>> When switch is coming up after reset, port 0 always reports >>> logical state ACTIVE. >>> OpenSM shouldn't clear sw->need_update flag because of port 0. >>> >>> Signed-off-by: Yevgeny Kliteynik >> >> Applied. Thanks. >> >> BTW is it legal from IBA point of view for switch to setup logical state >> of port 0 without SM intervention after reset? > > Good question. I don't remember any special port 0 treatment in the spec > WRT the port state... For base port 0 (BSP0), port state is not used. For enhanced port 0 (ESP0), it behaves like a TCA. I would expect that on reset, the port would be INIT and require the SM to walk it up to ACTIVE but once there it would likely remain there. I don't know if this impacts the patch incorporated or not or whether this discussion is an aside to that. -- Hal > -- Yevgeny > >> Sasha From chien.tin.tung at intel.com Thu Jan 15 07:38:12 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Thu, 15 Jan 2009 09:38:12 -0600 Subject: [ofa-general] [PATCH] RDMA/nes: update copyright to new legal entity and year Message-ID: <20090115153812.GA2212@ctung-MOBL> Update copyright to the new legal entity, Intel-NE, Inc. an Intel company. Update copyright for the new year. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes.c | 2 +- drivers/infiniband/hw/nes/nes.h | 2 +- drivers/infiniband/hw/nes/nes_cm.c | 2 +- drivers/infiniband/hw/nes/nes_cm.h | 2 +- drivers/infiniband/hw/nes/nes_context.h | 2 +- drivers/infiniband/hw/nes/nes_hw.c | 2 +- drivers/infiniband/hw/nes/nes_hw.h | 2 +- drivers/infiniband/hw/nes/nes_nic.c | 2 +- drivers/infiniband/hw/nes/nes_user.h | 2 +- drivers/infiniband/hw/nes/nes_utils.c | 2 +- drivers/infiniband/hw/nes/nes_verbs.c | 2 +- drivers/infiniband/hw/nes/nes_verbs.h | 2 +- 12 files changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c index b9611ad..ca59976 100644 --- a/drivers/infiniband/hw/nes/nes.c +++ b/drivers/infiniband/hw/nes/nes.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff --git a/drivers/infiniband/hw/nes/nes.h b/drivers/infiniband/hw/nes/nes.h index 13a5bb1..04b12ad 100644 --- a/drivers/infiniband/hw/nes/nes.h +++ b/drivers/infiniband/hw/nes/nes.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. * * This software is available to you under a choice of one of two diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index ca9ef3f..63b1c34 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h index fafa350..4ab2beb 100644 --- a/drivers/infiniband/hw/nes/nes_cm.h +++ b/drivers/infiniband/hw/nes/nes_cm.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/drivers/infiniband/hw/nes/nes_context.h b/drivers/infiniband/hw/nes/nes_context.h index da9daba..0fb8d81 100644 --- a/drivers/infiniband/hw/nes/nes_context.h +++ b/drivers/infiniband/hw/nes/nes_context.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 5d139db..cb4a5f3 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h index bc0b4de..6f8712d 100644 --- a/drivers/infiniband/hw/nes/nes_hw.h +++ b/drivers/infiniband/hw/nes/nes_hw.h @@ -1,5 +1,5 @@ /* -* Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. +* Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/drivers/infiniband/hw/nes/nes_nic.c b/drivers/infiniband/hw/nes/nes_nic.c index 57a47cf..8e1d073 100644 --- a/drivers/infiniband/hw/nes/nes_nic.c +++ b/drivers/infiniband/hw/nes/nes_nic.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/drivers/infiniband/hw/nes/nes_user.h b/drivers/infiniband/hw/nes/nes_user.h index e64306b..cc90c14 100644 --- a/drivers/infiniband/hw/nes/nes_user.h +++ b/drivers/infiniband/hw/nes/nes_user.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * Copyright (c) 2005 Topspin Communications. All rights reserved. * Copyright (c) 2005 Cisco Systems. All rights reserved. * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. diff --git a/drivers/infiniband/hw/nes/nes_utils.c b/drivers/infiniband/hw/nes/nes_utils.c index 6f3bc1b..a282031 100644 --- a/drivers/infiniband/hw/nes/nes_utils.c +++ b/drivers/infiniband/hw/nes/nes_utils.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 4fdb724..4cfb4d9 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU diff --git a/drivers/infiniband/hw/nes/nes_verbs.h b/drivers/infiniband/hw/nes/nes_verbs.h index 6c6b4da..da3c368 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.h +++ b/drivers/infiniband/hw/nes/nes_verbs.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2006 - 2008 NetEffect, Inc. All rights reserved. + * Copyright (c) 2006 - 2009 Intel-NE, Inc. All rights reserved. * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. * * This software is available to you under a choice of one of two -- 1.5.3.3 From dledford at redhat.com Thu Jan 15 07:52:00 2009 From: dledford at redhat.com (Doug Ledford) Date: Thu, 15 Jan 2009 10:52:00 -0500 Subject: [ofa-general] mlx4_en and ppc64, is it supposed to work? Message-ID: <1232034720.32405.1515.camel@firewall.xsintricity.com> It doesn't compile here, so I was wondering if it's even intended to. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From David.Chevalier at ge.com Thu Jan 15 07:10:27 2009 From: David.Chevalier at ge.com (Chevalier, David (GE Healthcare)) Date: Thu, 15 Jan 2009 10:10:27 -0500 Subject: [ofa-general] opensm master determination Message-ID: <68D58DEFB8673048A64DE1FBE56BEE180AB15AB8@CINMLVEM11.e2k.ad.ge.com> Hi - What is the algorithm that opensm uses to determine who is master among multiple opensm's? My scenario is that I have 2 nodes, each is running opensmd. In some test hardware, node A is always the master no matter which one starts opensmd first. In other test hardware, node B is always the master, again regardless of opensmd start order. I'm running OFED 1.3 with mthca driver on MT25208 based HCA. Thanks, Dave From YJia at tmriusa.com Thu Jan 15 09:47:14 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 15 Jan 2009 11:47:14 -0600 Subject: [ofa-general] local DMA transfer? Message-ID: Hi Folks, Is it possible to do local DMA transfer using QPs in a single port HCA? Thanks! Best, Yicheng Jia _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From PHF at zurich.ibm.com Thu Jan 15 10:37:05 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Thu, 15 Jan 2009 19:37:05 +0100 Subject: [ofa-general] QP needs a lot of memory Message-ID: Hello, I am running OFED 1.4 on a Chelsio T3 RNIC. When I was trying to connect a large number of clients (several hundred), I noticed that the server was running out of memory. One instance of a whenever I do an rdma_create_qp(), I lose about 7MB of main memory. This severly limits the scalability of my application. Is there a reason for that? Many thanks and best regards, Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Thu Jan 15 20:12:40 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 15 Jan 2009 20:12:40 -0800 Subject: [ofa-general] QP needs a lot of memory In-Reply-To: References: Message-ID: <000001c97790$acad6310$82cd180a@amr.corp.intel.com> >I am running OFED 1.4 on a Chelsio T3 RNIC. >When I was trying to connect a large number of clients (several hundred), >I noticed that the server was running out of memory. One instance of a >whenever I do an rdma_create_qp(), I lose about 7MB of main memory. >This severly limits the scalability of my application. > >Is there a reason for that? The amount of memory needed per QP is based on the send and receive queue sizes, plus the number of SGEs. I don't know specific details about the Chelsio adapter itself to know if 7MB is high or not. - Sean From rdreier at cisco.com Thu Jan 15 21:00:37 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 15 Jan 2009 21:00:37 -0800 Subject: [ofa-general] mlx4_en and ppc64, is it supposed to work? In-Reply-To: <1232034720.32405.1515.camel@firewall.xsintricity.com> (Doug Ledford's message of "Thu, 15 Jan 2009 10:52:00 -0500") References: <1232034720.32405.1515.camel@firewall.xsintricity.com> Message-ID: > It doesn't compile here, so I was wondering if it's even intended to. I'm able to build mlx4_en on powerpc here with Linus's latest tree with no problem. What problems do you see? - R. From nicolas.morey-chaisemartin at ext.bull.net Thu Jan 15 23:09:53 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Fri, 16 Jan 2009 08:09:53 +0100 Subject: [ofa-general] opensm master determination In-Reply-To: <68D58DEFB8673048A64DE1FBE56BEE180AB15AB8@CINMLVEM11.e2k.ad.ge.com> References: <68D58DEFB8673048A64DE1FBE56BEE180AB15AB8@CINMLVEM11.e2k.ad.ge.com> Message-ID: <497032C1.3090008@ext.bull.net> For the same priority level, the lowest GUID take control: in osm_sminfo_rcv.c: /********************************************************************** Return TRUE if the remote sm given (by ib_sm_info_t) is higher, return FALSE otherwise. By higher - we mean: SM with higher priority or with same priority and lower GUID. **********************************************************************/ static inline boolean_t __osm_sminfo_rcv_remote_sm_is_higher(IN osm_sm_t * sm, IN const ib_sm_info_t * p_remote_smi) Chevalier, David (GE Healthcare) wrote: > Hi - > What is the algorithm that opensm uses to determine who is master among > multiple opensm's? > > My scenario is that I have 2 nodes, each is running opensmd. > In some test hardware, node A is always the master no matter which one > starts opensmd first. > In other test hardware, node B is always the master, again regardless of > opensmd start order. > > I'm running OFED 1.3 with mthca driver on MT25208 based HCA. > > Thanks, > Dave > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > > From dorfman.eli at gmail.com Thu Jan 15 23:37:55 2009 From: dorfman.eli at gmail.com (Eli Dorfman) Date: Fri, 16 Jan 2009 09:37:55 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v3 2/2] infiniband-diags support PortXmitWait get and set In-Reply-To: <20090115135814.GD1640@sashak.voltaire.com> References: <4950E07F.6090104@gmail.com> <496CA539.1040204@gmail.com> <496CA602.1060600@gmail.com> <496CABB9.4030402@gmail.com> <20090114185446.GJ1441@sashak.voltaire.com> <496EDD56.3070309@gmail.com> <20090115135814.GD1640@sashak.voltaire.com> Message-ID: <694d48600901152337l43ca3913i2c7bee511ab10c35@mail.gmail.com> On Thu, Jan 15, 2009 at 3:58 PM, Sasha Khapyorsky wrote: > On 08:53 Thu 15 Jan , Eli Dorfman (Voltaire) wrote: >> Sasha Khapyorsky wrote: >> > Hi Eli, >> > >> > On 16:56 Tue 13 Jan , Eli Dorfman (Voltaire) wrote: >> >> support PortXmitWait get and set >> >> fix syntax error >> >> >> >> Signed-off-by: Eli Dorfman >> >> --- >> >> infiniband-diags/src/perfquery.c | 14 +++++++++++++- >> >> libibmad/src/gs.c | 2 ++ >> >> 2 files changed, 15 insertions(+), 1 deletions(-) >> >> >> >> diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c >> >> index 41a8b74..5e5b3ed 100644 >> >> --- a/infiniband-diags/src/perfquery.c >> >> +++ b/infiniband-diags/src/perfquery.c >> >> @@ -67,6 +67,7 @@ struct perf_count { >> >> uint32_t rcvdata; >> >> uint32_t xmtpkts; >> >> uint32_t rcvpkts; >> >> + uint32_t xmtwait; >> >> }; >> >> >> >> struct perf_count_ext { >> >> @@ -209,6 +210,8 @@ static void aggregate_perfcounters(void) >> >> aggregate_32bit(&perf_count.xmtpkts, val); >> >> mad_decode_field(pc, IB_PC_RCV_PKTS_F, &val); >> >> aggregate_32bit(&perf_count.rcvpkts, val); >> >> + mad_decode_field(pc, IB_PC_XMT_WAIT_F, &val); >> > >> > Please use tab character for indentation. >> > >> >> + aggregate_32bit(&perf_count.xmtwait, val); >> >> } >> >> >> >> static void output_aggregate_perfcounters(ib_portid_t *portid) >> >> @@ -235,6 +238,7 @@ static void output_aggregate_perfcounters(ib_portid_t *portid) >> >> mad_encode_field(pc, IB_PC_RCV_BYTES_F, &perf_count.rcvdata); >> >> mad_encode_field(pc, IB_PC_XMT_PKTS_F, &perf_count.xmtpkts); >> >> mad_encode_field(pc, IB_PC_RCV_PKTS_F, &perf_count.rcvpkts); >> >> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); >> >> >> >> mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc); >> >> >> >> @@ -298,9 +302,14 @@ static void dump_perfcounters(int extended, int timeout, uint16_t cap_mask, ib_p >> >> if (extended != 1) { >> >> if (!port_performance_query(pc, portid, port, timeout)) >> >> IBERROR("perfquery"); >> >> + if (!(cap_mask & 0x1000)) { >> >> + /* if PortCounters:PortXmitWait not suppported clear this counter */ >> >> + perf_count.xmtwait = 0; >> >> + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &perf_count.xmtwait); >> >> + } >> > >> > Is this a good thing to hide the reported value? We could to not show >> > XmitWait at all in case when it is not supported, or to show it as was >> > reported by port and not "to lie" about zero value. >> >> This is a good idea, but it requires to pass the mask to the mad_dump_perfcounters >> which is a generic function. >> I preferred to leave it like this than making a special case for perfcounters dump. > > So I'm removing this if ()...? Without changing the perfcounter dump a device that does not support XmitWait might return some garbage value. I preferred to show 0 instead. From jean-vincent.ficet at bull.net Fri Jan 16 01:58:31 2009 From: jean-vincent.ficet at bull.net (Vincent Ficet) Date: Fri, 16 Jan 2009 10:58:31 +0100 Subject: [ofa-general] difference between flint and mstflint Message-ID: <49705A47.8050606@bull.net> Hello, While using mstflint to read the characteristics of an IS4 switch, it fails as follows: [user at host ~ ] mstflint -d "lid-11" q Unable to parse device name lid-11 *** ERROR *** Can not open lid-11: Invalid argument MFE_CR_ERROR However, when I use flint, it works fine: [user at host ~ ] flint -d "lid-11" q Image type: FS2 FW Version: 7.1.0 Device ID: 48438 Chip Revision: A0 Description: Node Sys image GUIDs: 0002c90200404798 0002c9020040479b Board ID: (MT_0C20110003) VSD: PSID: MT_0C20110003 Looking at the flint code (MFT 2.5) and the mstflint code (git head), there is no difference apart from the version ID for the following files: - flint.cpp - mflash.c - mflash.h Digging deeper into the issue I found that difference was that: - mstflint uses the mopen() function implemented in mtcr.h. This function does not support the 'lid-' syntax - flint uses the mopend() function implemented in a separate library (libmtcr.a) and supports the 'lid-' syntax. This library is built using the files mtcr.c mtcr_ib.c mtcr_i2c.c usbioctl.c located in /usr/mst/src. Why doesn't mstflint use the same library as flint (hence providing better functionnality for inband analysis), knowing that both flint and mstflint already share a fair amount of common code ? Is there any plan to push the library and its corresponding source files into the mstflint git repository ? Thanks for your help, Vincent From vlad at lists.openfabrics.org Fri Jan 16 03:14:31 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 16 Jan 2009 03:14:31 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090116-0200 daily build status Message-ID: <20090116111431.63A2FE61112@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From nicolas.morey-chaisemartin at ext.bull.net Fri Jan 16 04:46:06 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Fri, 16 Jan 2009 13:46:06 +0100 Subject: [ofa-general] Question about perfmgr on OpenSM Message-ID: <4970818E.8070602@ext.bull.net> Hi, I'm working on a perf manager plugin for OpenSM. It will have quite advanced features which will do more than simply treating received events and extract some informations from OpenSM. I have one question though: how are perf manager plugins managed when there are multiple OpenSM running on the same subnet? -Is only the one on the MASTER SM started? -Are they all started but only the one on the MASTER SM received events (counters/trap) -They all start and received events but coming from different par of the subnet Due to the data I'm going to use, I need to have only one thread doing its job at a time. If there is only one plugin running on the Subnet it'll be fine, if there are several, I'll have to monitor the SM status to ensure only one of my threads is active. Regards Nicolas From dledford at redhat.com Fri Jan 16 05:26:17 2009 From: dledford at redhat.com (Doug Ledford) Date: Fri, 16 Jan 2009 08:26:17 -0500 Subject: [ofa-general] mlx4_en and ppc64, is it supposed to work? In-Reply-To: References: <1232034720.32405.1515.camel@firewall.xsintricity.com> Message-ID: <1232112377.32405.1551.camel@firewall.xsintricity.com> On Thu, 2009-01-15 at 21:00 -0800, Roland Dreier wrote: > > It doesn't compile here, so I was wondering if it's even intended to. > > I'm able to build mlx4_en on powerpc here with Linus's latest tree with > no problem. What problems do you see? Sorry, it turned out to be a red herring. The build failure was on ppc64iseries...I missed the all important iseries in that string and just saw the ppc64. The iseries doesn't support __raw_readl/__raw_writel and so we don't even compile the IB stack on iseries normally, but since the mlx4_en driver was new to the kernel, I had to go into the iseries config file and disable it specifically there or the fact that it was enabled in our generic config meant that it got merged into the iseries config as enabled. That was the source of the problem. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From arlin.r.davis at intel.com Fri Jan 16 10:40:32 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 16 Jan 2009 10:40:32 -0800 Subject: [ofa-general] [PATCH] dapl common: add debug output during thread create failure Message-ID: <000301c97809$e9f95480$ce97070a@amr.corp.intel.com> dapl common: add debug output during thread create failure. Need strerror messages to help isolate create thread errors. Signed-off-by: Arlin Davis --- dapl/udapl/linux/dapl_osd.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/dapl/udapl/linux/dapl_osd.c b/dapl/udapl/linux/dapl_osd.c index 26c778c..b828b91 100644 --- a/dapl/udapl/linux/dapl_osd.c +++ b/dapl/udapl/linux/dapl_osd.c @@ -562,6 +562,9 @@ dapl_os_thread_create ( status = pthread_attr_init ( &thread_attr ); if ( status != 0 ) { + dapl_log(DAPL_DBG_TYPE_ERR, + " uDAPL: pthread_attr_init: ERR=%s\n", + strerror(errno)); return DAT_ERROR (DAT_INTERNAL_ERROR,0); } @@ -570,6 +573,9 @@ dapl_os_thread_create ( PTHREAD_CREATE_DETACHED ); if ( status != 0 ) { + dapl_log(DAPL_DBG_TYPE_ERR, + " uDAPL: pthread_attr_setdetachstate: ERR=%s\n", + strerror(errno)); return DAT_ERROR (DAT_INTERNAL_ERROR,0); } @@ -592,6 +598,9 @@ dapl_os_thread_create ( if ( status != 0 ) { + dapl_log(DAPL_DBG_TYPE_ERR, + " uDAPL: pthread_create: ERR=%s\n", + strerror(errno)); return DAT_ERROR (DAT_INSUFFICIENT_RESOURCES,0); } -- 1.5.2.5 From arlin.r.davis at intel.com Fri Jan 16 10:44:06 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 16 Jan 2009 10:44:06 -0800 Subject: [ofa-general] [PATCH] dapl scm: remove unecessary thread when using direct objects Message-ID: <000401c9780a$6968ed20$ce97070a@amr.corp.intel.com> dapl scm: remove unecessary thread when using direct objects A thread is created for processing events on devices without direct CQ event object support. Since all openfabrics devices support direct events there is no need to start a thread in the provider. Move this under #ifndef CQ_WAIT_OBJECT. Signed-off-by: Arlin Davis --- dapl/openib_scm/dapl_ib_util.c | 7 +++++-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c index f1f6103..ca7746e 100644 --- a/dapl/openib_scm/dapl_ib_util.c +++ b/dapl/openib_scm/dapl_ib_util.c @@ -238,6 +238,7 @@ found: hca_ptr->ib_trans.mtu = dapl_ib_mtu(dapl_os_get_env_val("DAPL_IB_MTU", SCM_IB_MTU)); +#ifndef CQ_WAIT_OBJECT /* initialize cq_lock */ dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock); if (dat_status != DAT_SUCCESS) { @@ -272,7 +273,7 @@ found: ibv_get_device_name(hca_ptr->ib_trans.ib_dev)); goto bail; } - +#endif /* initialize cr_list lock */ dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock); if (dat_status != DAT_SUCCESS) { @@ -348,14 +349,16 @@ DAT_RETURN dapls_ib_close_hca ( IN DAPL_HCA *hca_ptr ) { dapl_dbg_log (DAPL_DBG_TYPE_UTIL," close_hca: %p\n",hca_ptr); +#ifndef CQ_WAIT_OBJECT dapli_cq_thread_destroy(hca_ptr); + dapl_os_lock_destroy(&hca_ptr->ib_trans.cq_lock); +#endif if (hca_ptr->ib_hca_handle != IB_INVALID_HANDLE) { if (ibv_close_device(hca_ptr->ib_hca_handle)) return(dapl_convert_errno(errno,"ib_close_device")); hca_ptr->ib_hca_handle = IB_INVALID_HANDLE; } - dapl_os_lock_destroy(&hca_ptr->ib_trans.cq_lock); /* destroy cr_thread and lock */ hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; -- 1.5.2.5 From rdreier at cisco.com Fri Jan 16 10:47:02 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jan 2009 10:47:02 -0800 Subject: [ofa-general] [PATCH] RDMA/nes: update copyright to new legal entity and year In-Reply-To: <20090115153812.GA2212@ctung-MOBL> (Chien Tung's message of "Thu, 15 Jan 2009 09:38:12 -0600") References: <20090115153812.GA2212@ctung-MOBL> Message-ID: thanks, applied. From rdreier at cisco.com Fri Jan 16 12:02:47 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jan 2009 12:02:47 -0800 Subject: [ofa-general] Re: [PATCH] mlx4_ib: fix for bugzilla 1383 (LSO packet processing) In-Reply-To: (Roland Dreier's message of "Tue, 30 Dec 2008 15:00:02 -0800") References: <200812291223.11753.jackm@dev.mellanox.co.il> <200812301920.50336.jackm@dev.mellanox.co.il> Message-ID: OK, I think I'm going to merge my version of the patch. If there really is a performance penalty I'd rather move the mlx transport stuff out-of-line first rather than make the code too unreadble with gotos and duplication etc. - R. From rdreier at cisco.com Fri Jan 16 12:10:30 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jan 2009 12:10:30 -0800 Subject: [ofa-general] Re: [PATCH] mlx4_ib: fix for bugzilla 1383 (LSO packet processing) In-Reply-To: <200812301920.50336.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 30 Dec 2008 19:20:49 +0200") References: <200812291223.11753.jackm@dev.mellanox.co.il> <200812301920.50336.jackm@dev.mellanox.co.il> Message-ID: > > Also is your code missing a memory barrier between the set_data_seg() > > loop and the lso_wqe assignment? It seems that an out-of-order CPU > > could make the lso_wqe visible before all the data segments are visible, > > so the bug could show up there anyway. > The memory barrier is unnecessary -- since the lso segment (without stamping) > is written first, then all the data segments are written (wmb() called for each segment), > then finally the lso stamping dword. Thus, this will always follow a wmb(). > (no one would send an LSO without any data). By the way, I think this is not quite right... set_data_seg() has a wmb() but before it writes the byte_count field. As I see it, a CPU could make lso_hdr_sz visible before byte_count if we just do for (i = wr->num_sge - 1; i >= 0; --i, --dseg) set_data_seg(dseg, wr->sg_list + i); *lso_wqe = lso_hdr_sz; and if lso_hdr_sz overwrites the stamping in the cacheline that the byte_count field is in, then the HCA could prefetch that cacheline with the wrong byte_count value in it and mess things up. So I'll merge the patch with the wmb() there, and you can convince me to get rid of it later if my reasoning is wrong. - R. From vst at vlnb.net Fri Jan 16 12:37:40 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Fri, 16 Jan 2009 23:37:40 +0300 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <496CDFE0.2030601@harr.org> References: <48E386F6.5040502@fusionio.com> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> Message-ID: <4970F014.2030101@vlnb.net> Cameron Harr, on 01/13/2009 09:39 PM wrote: > Vladislav Bolkhovitin wrote: >> Cameron Harr, on 01/13/2009 07:42 PM wrote: >>> Vlad, you've got a good eye. Unfortunately, those results can't >>> really be compared because I believe the previous results were >>> intentionally run in a worse-case performance scenario. However I did >>> run no-affinity runs before the affinity runs and would say >>> performance increase is variable and somewhat inconclusive: >>> >>> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >>> iops=76724.08 >>> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >>> iops=91318.28 >>> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >>> iops=60374.94 >>> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >>> iops=91618.18 >>> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >>> iops=63076.21 >>> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >>> iops=92251.24 >>> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >>> iops=50539.96 >>> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >>> iops=57884.80 >>> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >>> iops=54502.85 >>> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >>> iops=93230.44 >>> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >>> iops=55941.89 >>> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >>> iops=94480.92 >> For srptthread=0 case there is a consistent quite big increase. > For srptthread=0, there is indeed a difference between no-affinity and > affinity. However, here I meant there's not much of a difference between > srptthread=[01] on a number of the points - and overall on this > particular run it still seems like having the srp thread enabled still > gives better performance. We made a progress, but the affinity you set wasn't optimal, hence the results. We need to find out the optimal affinity layout. With it results for both srptthread=[01] should be at least the same. >>> My CPU config on the target (where I did the affinity) is 2 quad-core >>> Xeon E5440 @ 2.83GHz. I didn't have my script configured to dump top >>> and vmstat, so here's data from a rerun (and I have attached >>> requested info). I'm not sure what is accounting for the spike at the >>> beginning, but it seems consistent. >>> >>> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >>> iops=104699.43 >>> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >>> iops=133928.98 >>> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >>> iops=82736.73 >>> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >>> iops=82221.42 >>> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >>> iops=70203.53 >>> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >>> iops=85628.45 >>> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >>> iops=75646.90 >>> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >>> iops=87124.32 >>> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >>> iops=74545.84 >>> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >>> iops=88348.71 >>> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >>> iops=71837.15 >>> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >>> iops=84387.22 >> Why there is such a huge difference with the results you sent in the >> previous e-mail? For instance, for case drives=1 scst_threads=1 >> srptthread=1 104K vs 74K. What did you changed? > That's what I meant when I wasn't sure what was causing the spike at the > beginning. I didn't really do anything different other than rebooting. > One factor may be due to the supply of free blocks in the flash media. > As the number of free blocks decreases, garbage collection can increase > to free up previously-used blocks. However, there appears to be some > variance in the numbers that I can't account for. I just did another run > after forcing free blocks to a critical level and got 64K IOPs. >> What is content of /proc/interrupts after the tests? > You can see the HCA being interrupt-intensive, as well as iodrive, which > I'm surprised to see because I locked its worker threads to cpus 6 and 7. Try the following variants: 1. Affine IRQ 82, scsi_tgt0 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to CPU4, scsi_tgt1 to CPU1, fct1-worker to CPU3, scsi_tgt2 to CPU5, fct2-worker to CPU7 2. Affine IRQ 82 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to CPU4, fct1-worker to CPU3, fct2-worker to CPU7, no affinity for other processes. 3. Affine IRQ 82 to CPU0, IRQs 169 and 177 to CPU4, fct1-worker's to all CPUs, except CPU0 and CPU4, no affinity for other processes. Or other similar variants you'd like (even CPUs relate to physical CPU0, odd CPUs relate to physical CPU1). For instance, you can try to affine IRQs 169 and 177 to CPU1. No points to run for srptthread=1, for it just produce a baseline with no affinity at all. Please do each run several times and write down an average result between runs and approximate variation between them in %%. Otherwise we can't make any reliable conclusions. Thanks, Vlad From rdreier at cisco.com Fri Jan 16 13:43:08 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jan 2009 13:43:08 -0800 Subject: [ofa-general] Re: [PATCH 4/4] ipoib: do not print error messages for multicast join retries In-Reply-To: <4946A322.9020507@Voltaire.COM> (Yossi Etigin's message of "Mon, 15 Dec 2008 20:34:10 +0200") References: <49469C1E.8010307@Voltaire.COM> <4946A322.9020507@Voltaire.COM> Message-ID: thanks, applied. From rdreier at cisco.com Fri Jan 16 14:36:27 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jan 2009 14:36:27 -0800 Subject: [ofa-general] Re: [PATCH] infiniband/ehca: use consistent type In-Reply-To: <20081231141453.45d7f2c1.sfr@canb.auug.org.au> (Stephen Rothwell's message of "Wed, 31 Dec 2008 14:14:53 +1100") References: <20081231141453.45d7f2c1.sfr@canb.auug.org.au> Message-ID: I'm going to apply this for 2.6.29, since the change to the u64 type went upstream, so ehca spews a bunch of warnings now. I'll also add the following patch to fix all the printk format warnings: IB/ehca: Fix printk format warnings from u64 type change Commit fe333321 ("powerpc: Change u64/s64 to a long long integer type") changed u64 from unsigned long to unsigned long long, which means that printk formats for printing u64 values should use "ll" instead of "l" to avoid warnings. Fix all the places affected by this in ehca. Signed-off-by: Roland Dreier --- drivers/infiniband/hw/ehca/ehca_cq.c | 16 ++-- drivers/infiniband/hw/ehca/ehca_hca.c | 2 +- drivers/infiniband/hw/ehca/ehca_irq.c | 18 ++-- drivers/infiniband/hw/ehca/ehca_main.c | 6 +- drivers/infiniband/hw/ehca/ehca_mcast.c | 4 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 144 +++++++++++++++--------------- drivers/infiniband/hw/ehca/ehca_qp.c | 32 ++++---- drivers/infiniband/hw/ehca/ehca_reqs.c | 2 +- drivers/infiniband/hw/ehca/ehca_sqp.c | 2 +- drivers/infiniband/hw/ehca/ehca_tools.h | 2 +- drivers/infiniband/hw/ehca/ehca_uverbs.c | 2 +- drivers/infiniband/hw/ehca/hcp_if.c | 30 +++--- 12 files changed, 130 insertions(+), 130 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_cq.c b/drivers/infiniband/hw/ehca/ehca_cq.c index 2f4c28a..97e4b23 100644 --- a/drivers/infiniband/hw/ehca/ehca_cq.c +++ b/drivers/infiniband/hw/ehca/ehca_cq.c @@ -196,7 +196,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (h_ret != H_SUCCESS) { ehca_err(device, "hipz_h_alloc_resource_cq() failed " - "h_ret=%li device=%p", h_ret, device); + "h_ret=%lli device=%p", h_ret, device); cq = ERR_PTR(ehca2ib_return_code(h_ret)); goto create_cq_exit2; } @@ -232,7 +232,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if (h_ret < H_SUCCESS) { ehca_err(device, "hipz_h_register_rpage_cq() failed " - "ehca_cq=%p cq_num=%x h_ret=%li counter=%i " + "ehca_cq=%p cq_num=%x h_ret=%lli counter=%i " "act_pages=%i", my_cq, my_cq->cq_number, h_ret, counter, param.act_pages); cq = ERR_PTR(-EINVAL); @@ -244,7 +244,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, if ((h_ret != H_SUCCESS) || vpage) { ehca_err(device, "Registration of pages not " "complete ehca_cq=%p cq_num=%x " - "h_ret=%li", my_cq, my_cq->cq_number, + "h_ret=%lli", my_cq, my_cq->cq_number, h_ret); cq = ERR_PTR(-EAGAIN); goto create_cq_exit4; @@ -252,7 +252,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, } else { if (h_ret != H_PAGE_REGISTERED) { ehca_err(device, "Registration of page failed " - "ehca_cq=%p cq_num=%x h_ret=%li " + "ehca_cq=%p cq_num=%x h_ret=%lli " "counter=%i act_pages=%i", my_cq, my_cq->cq_number, h_ret, counter, param.act_pages); @@ -266,7 +266,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device, int cqe, int comp_vector, gal = my_cq->galpas.kernel; cqx_fec = hipz_galpa_load(gal, CQTEMM_OFFSET(cqx_fec)); - ehca_dbg(device, "ehca_cq=%p cq_num=%x CQX_FEC=%lx", + ehca_dbg(device, "ehca_cq=%p cq_num=%x CQX_FEC=%llx", my_cq, my_cq->cq_number, cqx_fec); my_cq->ib_cq.cqe = my_cq->nr_of_entries = @@ -307,7 +307,7 @@ create_cq_exit3: h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 1); if (h_ret != H_SUCCESS) ehca_err(device, "hipz_h_destroy_cq() failed ehca_cq=%p " - "cq_num=%x h_ret=%li", my_cq, my_cq->cq_number, h_ret); + "cq_num=%x h_ret=%lli", my_cq, my_cq->cq_number, h_ret); create_cq_exit2: write_lock_irqsave(&ehca_cq_idr_lock, flags); @@ -355,7 +355,7 @@ int ehca_destroy_cq(struct ib_cq *cq) h_ret = hipz_h_destroy_cq(adapter_handle, my_cq, 0); if (h_ret == H_R_STATE) { /* cq in err: read err data and destroy it forcibly */ - ehca_dbg(device, "ehca_cq=%p cq_num=%x ressource=%lx in err " + ehca_dbg(device, "ehca_cq=%p cq_num=%x resource=%llx in err " "state. Try to delete it forcibly.", my_cq, cq_num, my_cq->ipz_cq_handle.handle); ehca_error_data(shca, my_cq, my_cq->ipz_cq_handle.handle); @@ -365,7 +365,7 @@ int ehca_destroy_cq(struct ib_cq *cq) cq_num); } if (h_ret != H_SUCCESS) { - ehca_err(device, "hipz_h_destroy_cq() failed h_ret=%li " + ehca_err(device, "hipz_h_destroy_cq() failed h_ret=%lli " "ehca_cq=%p cq_num=%x", h_ret, my_cq, cq_num); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 4628822..9209c53 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -393,7 +393,7 @@ int ehca_modify_port(struct ib_device *ibdev, hret = hipz_h_modify_port(shca->ipz_hca_handle, port, cap, props->init_type, port_modify_mask); if (hret != H_SUCCESS) { - ehca_err(&shca->ib_device, "Modify port failed h_ret=%li", + ehca_err(&shca->ib_device, "Modify port failed h_ret=%lli", hret); ret = -EINVAL; } diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c b/drivers/infiniband/hw/ehca/ehca_irq.c index 3128a50..99bcbd7 100644 --- a/drivers/infiniband/hw/ehca/ehca_irq.c +++ b/drivers/infiniband/hw/ehca/ehca_irq.c @@ -99,7 +99,7 @@ static void print_error_data(struct ehca_shca *shca, void *data, return; ehca_err(&shca->ib_device, - "QP 0x%x (resource=%lx) has errors.", + "QP 0x%x (resource=%llx) has errors.", qp->ib_qp.qp_num, resource); break; } @@ -108,21 +108,21 @@ static void print_error_data(struct ehca_shca *shca, void *data, struct ehca_cq *cq = (struct ehca_cq *)data; ehca_err(&shca->ib_device, - "CQ 0x%x (resource=%lx) has errors.", + "CQ 0x%x (resource=%llx) has errors.", cq->cq_number, resource); break; } default: ehca_err(&shca->ib_device, - "Unknown error type: %lx on %s.", + "Unknown error type: %llx on %s.", type, shca->ib_device.name); break; } - ehca_err(&shca->ib_device, "Error data is available: %lx.", resource); + ehca_err(&shca->ib_device, "Error data is available: %llx.", resource); ehca_err(&shca->ib_device, "EHCA ----- error data begin " "---------------------------------------------------"); - ehca_dmp(rblock, length, "resource=%lx", resource); + ehca_dmp(rblock, length, "resource=%llx", resource); ehca_err(&shca->ib_device, "EHCA ----- error data end " "----------------------------------------------------"); @@ -152,7 +152,7 @@ int ehca_error_data(struct ehca_shca *shca, void *data, if (ret == H_R_STATE) ehca_err(&shca->ib_device, - "No error data is available: %lx.", resource); + "No error data is available: %llx.", resource); else if (ret == H_SUCCESS) { int length; @@ -164,7 +164,7 @@ int ehca_error_data(struct ehca_shca *shca, void *data, print_error_data(shca, data, rblock, length); } else ehca_err(&shca->ib_device, - "Error data could not be fetched: %lx", resource); + "Error data could not be fetched: %llx", resource); ehca_free_fw_ctrlblock(rblock); @@ -514,7 +514,7 @@ static inline void process_eqe(struct ehca_shca *shca, struct ehca_eqe *eqe) struct ehca_cq *cq; eqe_value = eqe->entry; - ehca_dbg(&shca->ib_device, "eqe_value=%lx", eqe_value); + ehca_dbg(&shca->ib_device, "eqe_value=%llx", eqe_value); if (EHCA_BMASK_GET(EQE_COMPLETION_EVENT, eqe_value)) { ehca_dbg(&shca->ib_device, "Got completion event"); token = EHCA_BMASK_GET(EQE_CQ_TOKEN, eqe_value); @@ -603,7 +603,7 @@ void ehca_process_eq(struct ehca_shca *shca, int is_irq) ret = hipz_h_eoi(eq->ist); if (ret != H_SUCCESS) ehca_err(&shca->ib_device, - "bad return code EOI -rc = %ld\n", ret); + "bad return code EOI -rc = %lld\n", ret); ehca_dbg(&shca->ib_device, "deadman found %x eqe", eqe_cnt); } if (unlikely(eqe_cnt == EHCA_EQE_CACHE_SIZE)) diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c index c7b8a50..368311c 100644 --- a/drivers/infiniband/hw/ehca/ehca_main.c +++ b/drivers/infiniband/hw/ehca/ehca_main.c @@ -304,7 +304,7 @@ static int ehca_sense_attributes(struct ehca_shca *shca) h_ret = hipz_h_query_hca(shca->ipz_hca_handle, rblock); if (h_ret != H_SUCCESS) { - ehca_gen_err("Cannot query device properties. h_ret=%li", + ehca_gen_err("Cannot query device properties. h_ret=%lli", h_ret); ret = -EPERM; goto sense_attributes1; @@ -391,7 +391,7 @@ static int ehca_sense_attributes(struct ehca_shca *shca) port = (struct hipz_query_port *)rblock; h_ret = hipz_h_query_port(shca->ipz_hca_handle, 1, port); if (h_ret != H_SUCCESS) { - ehca_gen_err("Cannot query port properties. h_ret=%li", + ehca_gen_err("Cannot query port properties. h_ret=%lli", h_ret); ret = -EPERM; goto sense_attributes1; @@ -682,7 +682,7 @@ static ssize_t ehca_show_adapter_handle(struct device *dev, { struct ehca_shca *shca = dev->driver_data; - return sprintf(buf, "%lx\n", shca->ipz_hca_handle.handle); + return sprintf(buf, "%llx\n", shca->ipz_hca_handle.handle); } static DEVICE_ATTR(adapter_handle, S_IRUGO, ehca_show_adapter_handle, NULL); diff --git a/drivers/infiniband/hw/ehca/ehca_mcast.c b/drivers/infiniband/hw/ehca/ehca_mcast.c index e3ef026..120aedf 100644 --- a/drivers/infiniband/hw/ehca/ehca_mcast.c +++ b/drivers/infiniband/hw/ehca/ehca_mcast.c @@ -88,7 +88,7 @@ int ehca_attach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) if (h_ret != H_SUCCESS) ehca_err(ibqp->device, "ehca_qp=%p qp_num=%x hipz_h_attach_mcqp() failed " - "h_ret=%li", my_qp, ibqp->qp_num, h_ret); + "h_ret=%lli", my_qp, ibqp->qp_num, h_ret); return ehca2ib_return_code(h_ret); } @@ -125,7 +125,7 @@ int ehca_detach_mcast(struct ib_qp *ibqp, union ib_gid *gid, u16 lid) if (h_ret != H_SUCCESS) ehca_err(ibqp->device, "ehca_qp=%p qp_num=%x hipz_h_detach_mcqp() failed " - "h_ret=%li", my_qp, ibqp->qp_num, h_ret); + "h_ret=%lli", my_qp, ibqp->qp_num, h_ret); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index f974367..72f83f7 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -204,7 +204,7 @@ struct ib_mr *ehca_reg_phys_mr(struct ib_pd *pd, } if ((size == 0) || (((u64)iova_start + size) < (u64)iova_start)) { - ehca_err(pd->device, "bad input values: size=%lx iova_start=%p", + ehca_err(pd->device, "bad input values: size=%llx iova_start=%p", size, iova_start); ib_mr = ERR_PTR(-EINVAL); goto reg_phys_mr_exit0; @@ -309,8 +309,8 @@ struct ib_mr *ehca_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, } if (length == 0 || virt + length < virt) { - ehca_err(pd->device, "bad input values: length=%lx " - "virt_base=%lx", length, virt); + ehca_err(pd->device, "bad input values: length=%llx " + "virt_base=%llx", length, virt); ib_mr = ERR_PTR(-EINVAL); goto reg_user_mr_exit0; } @@ -373,7 +373,7 @@ reg_user_mr_fallback: &e_mr->ib.ib_mr.rkey); if (ret == -EINVAL && pginfo.hwpage_size > PAGE_SIZE) { ehca_warn(pd->device, "failed to register mr " - "with hwpage_size=%lx", hwpage_size); + "with hwpage_size=%llx", hwpage_size); ehca_info(pd->device, "try to register mr with " "kpage_size=%lx", PAGE_SIZE); /* @@ -509,7 +509,7 @@ int ehca_rereg_phys_mr(struct ib_mr *mr, goto rereg_phys_mr_exit1; if ((new_size == 0) || (((u64)iova_start + new_size) < (u64)iova_start)) { - ehca_err(mr->device, "bad input values: new_size=%lx " + ehca_err(mr->device, "bad input values: new_size=%llx " "iova_start=%p", new_size, iova_start); ret = -EINVAL; goto rereg_phys_mr_exit1; @@ -580,8 +580,8 @@ int ehca_query_mr(struct ib_mr *mr, struct ib_mr_attr *mr_attr) h_ret = hipz_h_query_mr(shca->ipz_hca_handle, e_mr, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(mr->device, "hipz_mr_query failed, h_ret=%li mr=%p " - "hca_hndl=%lx mr_hndl=%lx lkey=%x", + ehca_err(mr->device, "hipz_mr_query failed, h_ret=%lli mr=%p " + "hca_hndl=%llx mr_hndl=%llx lkey=%x", h_ret, mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, mr->lkey); ret = ehca2ib_return_code(h_ret); @@ -630,8 +630,8 @@ int ehca_dereg_mr(struct ib_mr *mr) /* TODO: BUSY: MR still has bound window(s) */ h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { - ehca_err(mr->device, "hipz_free_mr failed, h_ret=%li shca=%p " - "e_mr=%p hca_hndl=%lx mr_hndl=%lx mr->lkey=%x", + ehca_err(mr->device, "hipz_free_mr failed, h_ret=%lli shca=%p " + "e_mr=%p hca_hndl=%llx mr_hndl=%llx mr->lkey=%x", h_ret, shca, e_mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, mr->lkey); ret = ehca2ib_return_code(h_ret); @@ -671,8 +671,8 @@ struct ib_mw *ehca_alloc_mw(struct ib_pd *pd) h_ret = hipz_h_alloc_resource_mw(shca->ipz_hca_handle, e_mw, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(pd->device, "hipz_mw_allocate failed, h_ret=%li " - "shca=%p hca_hndl=%lx mw=%p", + ehca_err(pd->device, "hipz_mw_allocate failed, h_ret=%lli " + "shca=%p hca_hndl=%llx mw=%p", h_ret, shca, shca->ipz_hca_handle.handle, e_mw); ib_mw = ERR_PTR(ehca2ib_return_code(h_ret)); goto alloc_mw_exit1; @@ -713,8 +713,8 @@ int ehca_dealloc_mw(struct ib_mw *mw) h_ret = hipz_h_free_resource_mw(shca->ipz_hca_handle, e_mw); if (h_ret != H_SUCCESS) { - ehca_err(mw->device, "hipz_free_mw failed, h_ret=%li shca=%p " - "mw=%p rkey=%x hca_hndl=%lx mw_hndl=%lx", + ehca_err(mw->device, "hipz_free_mw failed, h_ret=%lli shca=%p " + "mw=%p rkey=%x hca_hndl=%llx mw_hndl=%llx", h_ret, shca, mw, mw->rkey, shca->ipz_hca_handle.handle, e_mw->ipz_mw_handle.handle); return ehca2ib_return_code(h_ret); @@ -840,7 +840,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, goto map_phys_fmr_exit0; if (iova % e_fmr->fmr_page_size) { /* only whole-numbered pages */ - ehca_err(fmr->device, "bad iova, iova=%lx fmr_page_size=%x", + ehca_err(fmr->device, "bad iova, iova=%llx fmr_page_size=%x", iova, e_fmr->fmr_page_size); ret = -EINVAL; goto map_phys_fmr_exit0; @@ -878,7 +878,7 @@ int ehca_map_phys_fmr(struct ib_fmr *fmr, map_phys_fmr_exit0: if (ret) ehca_err(fmr->device, "ret=%i fmr=%p page_list=%p list_len=%x " - "iova=%lx", ret, fmr, page_list, list_len, iova); + "iova=%llx", ret, fmr, page_list, list_len, iova); return ret; } /* end ehca_map_phys_fmr() */ @@ -964,8 +964,8 @@ int ehca_dealloc_fmr(struct ib_fmr *fmr) h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); if (h_ret != H_SUCCESS) { - ehca_err(fmr->device, "hipz_free_mr failed, h_ret=%li e_fmr=%p " - "hca_hndl=%lx fmr_hndl=%lx fmr->lkey=%x", + ehca_err(fmr->device, "hipz_free_mr failed, h_ret=%lli e_fmr=%p " + "hca_hndl=%llx fmr_hndl=%llx fmr->lkey=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, fmr->lkey); ret = ehca2ib_return_code(h_ret); @@ -1007,8 +1007,8 @@ int ehca_reg_mr(struct ehca_shca *shca, (u64)iova_start, size, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_alloc_mr failed, h_ret=%li " - "hca_hndl=%lx", h_ret, shca->ipz_hca_handle.handle); + ehca_err(&shca->ib_device, "hipz_alloc_mr failed, h_ret=%lli " + "hca_hndl=%llx", h_ret, shca->ipz_hca_handle.handle); ret = ehca2ib_return_code(h_ret); goto ehca_reg_mr_exit0; } @@ -1033,9 +1033,9 @@ int ehca_reg_mr(struct ehca_shca *shca, ehca_reg_mr_exit1: h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "h_ret=%li shca=%p e_mr=%p " - "iova_start=%p size=%lx acl=%x e_pd=%p lkey=%x " - "pginfo=%p num_kpages=%lx num_hwpages=%lx ret=%i", + ehca_err(&shca->ib_device, "h_ret=%lli shca=%p e_mr=%p " + "iova_start=%p size=%llx acl=%x e_pd=%p lkey=%x " + "pginfo=%p num_kpages=%llx num_hwpages=%llx ret=%i", h_ret, shca, e_mr, iova_start, size, acl, e_pd, hipzout.lkey, pginfo, pginfo->num_kpages, pginfo->num_hwpages, ret); @@ -1045,8 +1045,8 @@ ehca_reg_mr_exit1: ehca_reg_mr_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p " - "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p " - "num_kpages=%lx num_hwpages=%lx", + "iova_start=%p size=%llx acl=%x e_pd=%p pginfo=%p " + "num_kpages=%llx num_hwpages=%llx", ret, shca, e_mr, iova_start, size, acl, e_pd, pginfo, pginfo->num_kpages, pginfo->num_hwpages); return ret; @@ -1116,8 +1116,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, */ if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "last " - "hipz_reg_rpage_mr failed, h_ret=%li " - "e_mr=%p i=%x hca_hndl=%lx mr_hndl=%lx" + "hipz_reg_rpage_mr failed, h_ret=%lli " + "e_mr=%p i=%x hca_hndl=%llx mr_hndl=%llx" " lkey=%x", h_ret, e_mr, i, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, @@ -1128,8 +1128,8 @@ int ehca_reg_mr_rpages(struct ehca_shca *shca, ret = 0; } else if (h_ret != H_PAGE_REGISTERED) { ehca_err(&shca->ib_device, "hipz_reg_rpage_mr failed, " - "h_ret=%li e_mr=%p i=%x lkey=%x hca_hndl=%lx " - "mr_hndl=%lx", h_ret, e_mr, i, + "h_ret=%lli e_mr=%p i=%x lkey=%x hca_hndl=%llx " + "mr_hndl=%llx", h_ret, e_mr, i, e_mr->ib.ib_mr.lkey, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle); @@ -1145,7 +1145,7 @@ ehca_reg_mr_rpages_exit1: ehca_reg_mr_rpages_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p pginfo=%p " - "num_kpages=%lx num_hwpages=%lx", ret, shca, e_mr, + "num_kpages=%llx num_hwpages=%llx", ret, shca, e_mr, pginfo, pginfo->num_kpages, pginfo->num_hwpages); return ret; } /* end ehca_reg_mr_rpages() */ @@ -1184,7 +1184,7 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, ret = ehca_set_pagebuf(pginfo, pginfo->num_hwpages, kpage); if (ret) { ehca_err(&shca->ib_device, "set pagebuf failed, e_mr=%p " - "pginfo=%p type=%x num_kpages=%lx num_hwpages=%lx " + "pginfo=%p type=%x num_kpages=%llx num_hwpages=%llx " "kpage=%p", e_mr, pginfo, pginfo->type, pginfo->num_kpages, pginfo->num_hwpages, kpage); goto ehca_rereg_mr_rereg1_exit1; @@ -1205,13 +1205,13 @@ inline int ehca_rereg_mr_rereg1(struct ehca_shca *shca, * (MW bound or MR is shared) */ ehca_warn(&shca->ib_device, "hipz_h_reregister_pmr failed " - "(Rereg1), h_ret=%li e_mr=%p", h_ret, e_mr); + "(Rereg1), h_ret=%lli e_mr=%p", h_ret, e_mr); *pginfo = pginfo_save; ret = -EAGAIN; } else if ((u64 *)hipzout.vaddr != iova_start) { ehca_err(&shca->ib_device, "PHYP changed iova_start in " - "rereg_pmr, iova_start=%p iova_start_out=%lx e_mr=%p " - "mr_handle=%lx lkey=%x lkey_out=%x", iova_start, + "rereg_pmr, iova_start=%p iova_start_out=%llx e_mr=%p " + "mr_handle=%llx lkey=%x lkey_out=%x", iova_start, hipzout.vaddr, e_mr, e_mr->ipz_mr_handle.handle, e_mr->ib.ib_mr.lkey, hipzout.lkey); ret = -EFAULT; @@ -1235,7 +1235,7 @@ ehca_rereg_mr_rereg1_exit1: ehca_rereg_mr_rereg1_exit0: if ( ret && (ret != -EAGAIN) ) ehca_err(&shca->ib_device, "ret=%i lkey=%x rkey=%x " - "pginfo=%p num_kpages=%lx num_hwpages=%lx", + "pginfo=%p num_kpages=%llx num_hwpages=%llx", ret, *lkey, *rkey, pginfo, pginfo->num_kpages, pginfo->num_hwpages); return ret; @@ -1263,7 +1263,7 @@ int ehca_rereg_mr(struct ehca_shca *shca, (e_mr->num_hwpages > MAX_RPAGES) || (pginfo->num_hwpages > e_mr->num_hwpages)) { ehca_dbg(&shca->ib_device, "Rereg3 case, " - "pginfo->num_hwpages=%lx e_mr->num_hwpages=%x", + "pginfo->num_hwpages=%llx e_mr->num_hwpages=%x", pginfo->num_hwpages, e_mr->num_hwpages); rereg_1_hcall = 0; rereg_3_hcall = 1; @@ -1295,7 +1295,7 @@ int ehca_rereg_mr(struct ehca_shca *shca, h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_mr); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_free_mr failed, " - "h_ret=%li e_mr=%p hca_hndl=%lx mr_hndl=%lx " + "h_ret=%lli e_mr=%p hca_hndl=%llx mr_hndl=%llx " "mr->lkey=%x", h_ret, e_mr, shca->ipz_hca_handle.handle, e_mr->ipz_mr_handle.handle, @@ -1328,8 +1328,8 @@ int ehca_rereg_mr(struct ehca_shca *shca, ehca_rereg_mr_exit0: if (ret) ehca_err(&shca->ib_device, "ret=%i shca=%p e_mr=%p " - "iova_start=%p size=%lx acl=%x e_pd=%p pginfo=%p " - "num_kpages=%lx lkey=%x rkey=%x rereg_1_hcall=%x " + "iova_start=%p size=%llx acl=%x e_pd=%p pginfo=%p " + "num_kpages=%llx lkey=%x rkey=%x rereg_1_hcall=%x " "rereg_3_hcall=%x", ret, shca, e_mr, iova_start, size, acl, e_pd, pginfo, pginfo->num_kpages, *lkey, *rkey, rereg_1_hcall, rereg_3_hcall); @@ -1371,8 +1371,8 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, * FMRs are not shared and no MW bound to FMRs */ ehca_err(&shca->ib_device, "hipz_reregister_pmr failed " - "(Rereg1), h_ret=%li e_fmr=%p hca_hndl=%lx " - "mr_hndl=%lx lkey=%x lkey_out=%x", + "(Rereg1), h_ret=%lli e_fmr=%p hca_hndl=%llx " + "mr_hndl=%llx lkey=%x lkey_out=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, e_fmr->ib.ib_fmr.lkey, hipzout.lkey); @@ -1383,7 +1383,7 @@ int ehca_unmap_one_fmr(struct ehca_shca *shca, h_ret = hipz_h_free_resource_mr(shca->ipz_hca_handle, e_fmr); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_free_mr failed, " - "h_ret=%li e_fmr=%p hca_hndl=%lx mr_hndl=%lx " + "h_ret=%lli e_fmr=%p hca_hndl=%llx mr_hndl=%llx " "lkey=%x", h_ret, e_fmr, shca->ipz_hca_handle.handle, e_fmr->ipz_mr_handle.handle, @@ -1447,9 +1447,9 @@ int ehca_reg_smr(struct ehca_shca *shca, (u64)iova_start, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%li " + ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%lli " "shca=%p e_origmr=%p e_newmr=%p iova_start=%p acl=%x " - "e_pd=%p hca_hndl=%lx mr_hndl=%lx lkey=%x", + "e_pd=%p hca_hndl=%llx mr_hndl=%llx lkey=%x", h_ret, shca, e_origmr, e_newmr, iova_start, acl, e_pd, shca->ipz_hca_handle.handle, e_origmr->ipz_mr_handle.handle, @@ -1527,7 +1527,7 @@ int ehca_reg_internal_maxmr( &e_mr->ib.ib_mr.rkey); if (ret) { ehca_err(&shca->ib_device, "reg of internal max MR failed, " - "e_mr=%p iova_start=%p size_maxmr=%lx num_kpages=%x " + "e_mr=%p iova_start=%p size_maxmr=%llx num_kpages=%x " "num_hwpages=%x", e_mr, iova_start, size_maxmr, num_kpages, num_hwpages); goto ehca_reg_internal_maxmr_exit1; @@ -1573,8 +1573,8 @@ int ehca_reg_maxmr(struct ehca_shca *shca, (u64)iova_start, hipz_acl, e_pd->fw_pd, &hipzout); if (h_ret != H_SUCCESS) { - ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%li " - "e_origmr=%p hca_hndl=%lx mr_hndl=%lx lkey=%x", + ehca_err(&shca->ib_device, "hipz_reg_smr failed, h_ret=%lli " + "e_origmr=%p hca_hndl=%llx mr_hndl=%llx lkey=%x", h_ret, e_origmr, shca->ipz_hca_handle.handle, e_origmr->ipz_mr_handle.handle, e_origmr->ib.ib_mr.lkey); @@ -1651,28 +1651,28 @@ int ehca_mr_chk_buf_and_calc_size(struct ib_phys_buf *phys_buf_array, /* check first buffer */ if (((u64)iova_start & ~PAGE_MASK) != (pbuf->addr & ~PAGE_MASK)) { ehca_gen_err("iova_start/addr mismatch, iova_start=%p " - "pbuf->addr=%lx pbuf->size=%lx", + "pbuf->addr=%llx pbuf->size=%llx", iova_start, pbuf->addr, pbuf->size); return -EINVAL; } if (((pbuf->addr + pbuf->size) % PAGE_SIZE) && (num_phys_buf > 1)) { - ehca_gen_err("addr/size mismatch in 1st buf, pbuf->addr=%lx " - "pbuf->size=%lx", pbuf->addr, pbuf->size); + ehca_gen_err("addr/size mismatch in 1st buf, pbuf->addr=%llx " + "pbuf->size=%llx", pbuf->addr, pbuf->size); return -EINVAL; } for (i = 0; i < num_phys_buf; i++) { if ((i > 0) && (pbuf->addr % PAGE_SIZE)) { - ehca_gen_err("bad address, i=%x pbuf->addr=%lx " - "pbuf->size=%lx", + ehca_gen_err("bad address, i=%x pbuf->addr=%llx " + "pbuf->size=%llx", i, pbuf->addr, pbuf->size); return -EINVAL; } if (((i > 0) && /* not 1st */ (i < (num_phys_buf - 1)) && /* not last */ (pbuf->size % PAGE_SIZE)) || (pbuf->size == 0)) { - ehca_gen_err("bad size, i=%x pbuf->size=%lx", + ehca_gen_err("bad size, i=%x pbuf->size=%llx", i, pbuf->size); return -EINVAL; } @@ -1705,7 +1705,7 @@ int ehca_fmr_check_page_list(struct ehca_mr *e_fmr, page = page_list; for (i = 0; i < list_len; i++) { if (*page % e_fmr->fmr_page_size) { - ehca_gen_err("bad page, i=%x *page=%lx page=%p fmr=%p " + ehca_gen_err("bad page, i=%x *page=%llx page=%p fmr=%p " "fmr_page_size=%x", i, *page, page, e_fmr, e_fmr->fmr_page_size); return -EINVAL; @@ -1743,9 +1743,9 @@ static int ehca_set_pagebuf_user1(struct ehca_mr_pginfo *pginfo, (pginfo->next_hwpage * pginfo->hwpage_size)); if ( !(*kpage) ) { - ehca_gen_err("pgaddr=%lx " - "chunk->page_list[i]=%lx " - "i=%x next_hwpage=%lx", + ehca_gen_err("pgaddr=%llx " + "chunk->page_list[i]=%llx " + "i=%x next_hwpage=%llx", pgaddr, (u64)sg_dma_address( &chunk->page_list[i]), i, pginfo->next_hwpage); @@ -1795,11 +1795,11 @@ static int ehca_check_kpages_per_ate(struct scatterlist *page_list, for (t = start_idx; t <= end_idx; t++) { u64 pgaddr = page_to_pfn(sg_page(&page_list[t])) << PAGE_SHIFT; if (ehca_debug_level >= 3) - ehca_gen_dbg("chunk_page=%lx value=%016lx", pgaddr, + ehca_gen_dbg("chunk_page=%llx value=%016llx", pgaddr, *(u64 *)abs_to_virt(phys_to_abs(pgaddr))); if (pgaddr - PAGE_SIZE != *prev_pgaddr) { - ehca_gen_err("uncontiguous page found pgaddr=%lx " - "prev_pgaddr=%lx page_list_i=%x", + ehca_gen_err("uncontiguous page found pgaddr=%llx " + "prev_pgaddr=%llx page_list_i=%x", pgaddr, *prev_pgaddr, t); return -EINVAL; } @@ -1833,7 +1833,7 @@ static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo, << PAGE_SHIFT ); *kpage = phys_to_abs(pgaddr); if ( !(*kpage) ) { - ehca_gen_err("pgaddr=%lx i=%x", + ehca_gen_err("pgaddr=%llx i=%x", pgaddr, i); ret = -EFAULT; return ret; @@ -1846,8 +1846,8 @@ static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo, if (pginfo->hwpage_cnt) { ehca_gen_err( "invalid alignment " - "pgaddr=%lx i=%x " - "mr_pgsize=%lx", + "pgaddr=%llx i=%x " + "mr_pgsize=%llx", pgaddr, i, pginfo->hwpage_size); ret = -EFAULT; @@ -1866,8 +1866,8 @@ static int ehca_set_pagebuf_user2(struct ehca_mr_pginfo *pginfo, if (ehca_debug_level >= 3) { u64 val = *(u64 *)abs_to_virt( phys_to_abs(pgaddr)); - ehca_gen_dbg("kpage=%lx chunk_page=%lx " - "value=%016lx", + ehca_gen_dbg("kpage=%llx chunk_page=%llx " + "value=%016llx", *kpage, pgaddr, val); } prev_pgaddr = pgaddr; @@ -1944,9 +1944,9 @@ static int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, if ((pginfo->kpage_cnt >= pginfo->num_kpages) || (pginfo->hwpage_cnt >= pginfo->num_hwpages)) { ehca_gen_err("kpage_cnt >= num_kpages, " - "kpage_cnt=%lx num_kpages=%lx " - "hwpage_cnt=%lx " - "num_hwpages=%lx i=%x", + "kpage_cnt=%llx num_kpages=%llx " + "hwpage_cnt=%llx " + "num_hwpages=%llx i=%x", pginfo->kpage_cnt, pginfo->num_kpages, pginfo->hwpage_cnt, @@ -1957,8 +1957,8 @@ static int ehca_set_pagebuf_phys(struct ehca_mr_pginfo *pginfo, (pbuf->addr & ~(pginfo->hwpage_size - 1)) + (pginfo->next_hwpage * pginfo->hwpage_size)); if ( !(*kpage) && pbuf->addr ) { - ehca_gen_err("pbuf->addr=%lx pbuf->size=%lx " - "next_hwpage=%lx", pbuf->addr, + ehca_gen_err("pbuf->addr=%llx pbuf->size=%llx " + "next_hwpage=%llx", pbuf->addr, pbuf->size, pginfo->next_hwpage); return -EFAULT; } @@ -1996,8 +1996,8 @@ static int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, *kpage = phys_to_abs((*fmrlist & ~(pginfo->hwpage_size - 1)) + pginfo->next_hwpage * pginfo->hwpage_size); if ( !(*kpage) ) { - ehca_gen_err("*fmrlist=%lx fmrlist=%p " - "next_listelem=%lx next_hwpage=%lx", + ehca_gen_err("*fmrlist=%llx fmrlist=%p " + "next_listelem=%llx next_hwpage=%llx", *fmrlist, fmrlist, pginfo->u.fmr.next_listelem, pginfo->next_hwpage); @@ -2025,7 +2025,7 @@ static int ehca_set_pagebuf_fmr(struct ehca_mr_pginfo *pginfo, ~(pginfo->hwpage_size - 1)); if (prev + pginfo->u.fmr.fmr_pgsize != p) { ehca_gen_err("uncontiguous fmr pages " - "found prev=%lx p=%lx " + "found prev=%llx p=%llx " "idx=%x", prev, p, i + j); return -EINVAL; } diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index f161cf1..00c1081 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -331,7 +331,7 @@ static inline int init_qp_queue(struct ehca_shca *shca, if (cnt == (nr_q_pages - 1)) { /* last page! */ if (h_ret != expected_hret) { ehca_err(ib_dev, "hipz_qp_register_rpage() " - "h_ret=%li", h_ret); + "h_ret=%lli", h_ret); ret = ehca2ib_return_code(h_ret); goto init_qp_queue1; } @@ -345,7 +345,7 @@ static inline int init_qp_queue(struct ehca_shca *shca, } else { if (h_ret != H_PAGE_REGISTERED) { ehca_err(ib_dev, "hipz_qp_register_rpage() " - "h_ret=%li", h_ret); + "h_ret=%lli", h_ret); ret = ehca2ib_return_code(h_ret); goto init_qp_queue1; } @@ -709,7 +709,7 @@ static struct ehca_qp *internal_create_qp( h_ret = hipz_h_alloc_resource_qp(shca->ipz_hca_handle, &parms); if (h_ret != H_SUCCESS) { - ehca_err(pd->device, "h_alloc_resource_qp() failed h_ret=%li", + ehca_err(pd->device, "h_alloc_resource_qp() failed h_ret=%lli", h_ret); ret = ehca2ib_return_code(h_ret); goto create_qp_exit1; @@ -1010,7 +1010,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not modify SRQ to INIT " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -1024,7 +1024,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not enable SRQ " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -1038,7 +1038,7 @@ struct ib_srq *ehca_create_srq(struct ib_pd *pd, mqpcb, my_qp->galpas.kernel); if (hret != H_SUCCESS) { ehca_err(pd->device, "Could not modify SRQ to RTR " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, my_qp->real_qp_num, hret); goto create_srq2; } @@ -1078,7 +1078,7 @@ static int prepare_sqe_rts(struct ehca_qp *my_qp, struct ehca_shca *shca, &bad_send_wqe_p, NULL, 2); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "hipz_h_disable_and_get_wqe() failed" - " ehca_qp=%p qp_num=%x h_ret=%li", + " ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, qp_num, h_ret); return ehca2ib_return_code(h_ret); } @@ -1134,7 +1134,7 @@ static int calc_left_cqes(u64 wqe_p, struct ipz_queue *ipz_queue, if (ipz_queue_abs_to_offset(ipz_queue, wqe_p, &q_ofs)) { ehca_gen_err("Invalid offset for calculating left cqes " - "wqe_p=%#lx wqe_v=%p\n", wqe_p, wqe_v); + "wqe_p=%#llx wqe_v=%p\n", wqe_p, wqe_v); return -EFAULT; } @@ -1168,7 +1168,7 @@ static int check_for_left_cqes(struct ehca_qp *my_qp, struct ehca_shca *shca) &send_wqe_p, &recv_wqe_p, 4); if (h_ret != H_SUCCESS) { ehca_err(&shca->ib_device, "disable_and_get_wqe() " - "failed ehca_qp=%p qp_num=%x h_ret=%li", + "failed ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, qp_num, h_ret); return ehca2ib_return_code(h_ret); } @@ -1261,7 +1261,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, mqpcb, my_qp->galpas.kernel); if (h_ret != H_SUCCESS) { ehca_err(ibqp->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, ibqp->qp_num, h_ret); ret = ehca2ib_return_code(h_ret); goto modify_qp_exit1; @@ -1690,7 +1690,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); - ehca_err(ibqp->device, "hipz_h_modify_qp() failed h_ret=%li " + ehca_err(ibqp->device, "hipz_h_modify_qp() failed h_ret=%lli " "ehca_qp=%p qp_num=%x", h_ret, my_qp, ibqp->qp_num); goto modify_qp_exit2; } @@ -1723,7 +1723,7 @@ static int internal_modify_qp(struct ib_qp *ibqp, ret = ehca2ib_return_code(h_ret); ehca_err(ibqp->device, "ENABLE in context of " "RESET_2_INIT failed! Maybe you didn't get " - "a LID h_ret=%li ehca_qp=%p qp_num=%x", + "a LID h_ret=%lli ehca_qp=%p qp_num=%x", h_ret, my_qp, ibqp->qp_num); goto modify_qp_exit2; } @@ -1909,7 +1909,7 @@ int ehca_query_qp(struct ib_qp *qp, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); ehca_err(qp->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, qp->qp_num, h_ret); goto query_qp_exit1; } @@ -2074,7 +2074,7 @@ int ehca_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); - ehca_err(ibsrq->device, "hipz_h_modify_qp() failed h_ret=%li " + ehca_err(ibsrq->device, "hipz_h_modify_qp() failed h_ret=%lli " "ehca_qp=%p qp_num=%x", h_ret, my_qp, my_qp->real_qp_num); } @@ -2108,7 +2108,7 @@ int ehca_query_srq(struct ib_srq *srq, struct ib_srq_attr *srq_attr) if (h_ret != H_SUCCESS) { ret = ehca2ib_return_code(h_ret); ehca_err(srq->device, "hipz_h_query_qp() failed " - "ehca_qp=%p qp_num=%x h_ret=%li", + "ehca_qp=%p qp_num=%x h_ret=%lli", my_qp, my_qp->real_qp_num, h_ret); goto query_srq_exit1; } @@ -2179,7 +2179,7 @@ static int internal_destroy_qp(struct ib_device *dev, struct ehca_qp *my_qp, h_ret = hipz_h_destroy_qp(shca->ipz_hca_handle, my_qp); if (h_ret != H_SUCCESS) { - ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%li " + ehca_err(dev, "hipz_h_destroy_qp() failed h_ret=%lli " "ehca_qp=%p qp_num=%x", h_ret, my_qp, qp_num); return ehca2ib_return_code(h_ret); } diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c index c711268..5a3d96f 100644 --- a/drivers/infiniband/hw/ehca/ehca_reqs.c +++ b/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -822,7 +822,7 @@ static int generate_flush_cqes(struct ehca_qp *my_qp, struct ib_cq *cq, offset = qmap->next_wqe_idx * ipz_queue->qe_size; wqe = (struct ehca_wqe *)ipz_qeit_calc(ipz_queue, offset); if (!wqe) { - ehca_err(cq->device, "Invalid wqe offset=%#lx on " + ehca_err(cq->device, "Invalid wqe offset=%#llx on " "qp_num=%#x", offset, my_qp->real_qp_num); return nr; } diff --git a/drivers/infiniband/hw/ehca/ehca_sqp.c b/drivers/infiniband/hw/ehca/ehca_sqp.c index 706d97a..44447aa 100644 --- a/drivers/infiniband/hw/ehca/ehca_sqp.c +++ b/drivers/infiniband/hw/ehca/ehca_sqp.c @@ -85,7 +85,7 @@ u64 ehca_define_sqp(struct ehca_shca *shca, if (ret != H_SUCCESS) { ehca_err(&shca->ib_device, - "Can't define AQP1 for port %x. h_ret=%li", + "Can't define AQP1 for port %x. h_ret=%lli", port, ret); return ret; } diff --git a/drivers/infiniband/hw/ehca/ehca_tools.h b/drivers/infiniband/hw/ehca/ehca_tools.h index 21f7d06..f09914c 100644 --- a/drivers/infiniband/hw/ehca/ehca_tools.h +++ b/drivers/infiniband/hw/ehca/ehca_tools.h @@ -116,7 +116,7 @@ extern int ehca_debug_level; unsigned char *deb = (unsigned char *)(adr); \ for (x = 0; x < l; x += 16) { \ printk(KERN_INFO "EHCA_DMP:%s " format \ - " adr=%p ofs=%04x %016lx %016lx\n", \ + " adr=%p ofs=%04x %016llx %016llx\n", \ __func__, ##args, deb, x, \ *((u64 *)&deb[0]), *((u64 *)&deb[8])); \ deb += 16; \ diff --git a/drivers/infiniband/hw/ehca/ehca_uverbs.c b/drivers/infiniband/hw/ehca/ehca_uverbs.c index e43ed8f..3cb688d 100644 --- a/drivers/infiniband/hw/ehca/ehca_uverbs.c +++ b/drivers/infiniband/hw/ehca/ehca_uverbs.c @@ -114,7 +114,7 @@ static int ehca_mmap_fw(struct vm_area_struct *vma, struct h_galpas *galpas, physical = galpas->user.fw_handle; vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); - ehca_gen_dbg("vsize=%lx physical=%lx", vsize, physical); + ehca_gen_dbg("vsize=%llx physical=%llx", vsize, physical); /* VM_IO | VM_RESERVED are set by remap_pfn_range() */ ret = remap_4k_pfn(vma, vma->vm_start, physical >> EHCA_PAGESHIFT, vma->vm_page_prot); diff --git a/drivers/infiniband/hw/ehca/hcp_if.c b/drivers/infiniband/hw/ehca/hcp_if.c index 415d3a4..7a13a24 100644 --- a/drivers/infiniband/hw/ehca/hcp_if.c +++ b/drivers/infiniband/hw/ehca/hcp_if.c @@ -249,7 +249,7 @@ u64 hipz_h_alloc_resource_eq(const struct ipz_adapter_handle adapter_handle, *eq_ist = (u32)outs[5]; if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resource - ret=%li ", ret); + ehca_gen_err("Not enough resource - ret=%lli ", ret); return ret; } @@ -287,7 +287,7 @@ u64 hipz_h_alloc_resource_cq(const struct ipz_adapter_handle adapter_handle, hcp_galpas_ctor(&cq->galpas, outs[5], outs[6]); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%li", ret); + ehca_gen_err("Not enough resources. ret=%lli", ret); return ret; } @@ -362,7 +362,7 @@ u64 hipz_h_alloc_resource_qp(const struct ipz_adapter_handle adapter_handle, hcp_galpas_ctor(&parms->galpas, outs[6], outs[6]); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%li", ret); + ehca_gen_err("Not enough resources. ret=%lli", ret); return ret; } @@ -454,7 +454,7 @@ u64 hipz_h_register_rpage_eq(const struct ipz_adapter_handle adapter_handle, const u64 count) { if (count != 1) { - ehca_gen_err("Ppage counter=%lx", count); + ehca_gen_err("Ppage counter=%llx", count); return H_PARAMETER; } return hipz_h_register_rpage(adapter_handle, @@ -489,7 +489,7 @@ u64 hipz_h_register_rpage_cq(const struct ipz_adapter_handle adapter_handle, const struct h_galpa gal) { if (count != 1) { - ehca_gen_err("Page counter=%lx", count); + ehca_gen_err("Page counter=%llx", count); return H_PARAMETER; } @@ -508,7 +508,7 @@ u64 hipz_h_register_rpage_qp(const struct ipz_adapter_handle adapter_handle, const struct h_galpa galpa) { if (count > 1) { - ehca_gen_err("Page counter=%lx", count); + ehca_gen_err("Page counter=%llx", count); return H_PARAMETER; } @@ -557,7 +557,7 @@ u64 hipz_h_modify_qp(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Insufficient resources ret=%li", ret); + ehca_gen_err("Insufficient resources ret=%lli", ret); return ret; } @@ -593,7 +593,7 @@ u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, qp->ipz_qp_handle.handle, /* r6 */ 0, 0, 0, 0, 0, 0); if (ret == H_HARDWARE) - ehca_gen_err("HCA not operational. ret=%li", ret); + ehca_gen_err("HCA not operational. ret=%lli", ret); ret = ehca_plpar_hcall_norets(H_FREE_RESOURCE, adapter_handle.handle, /* r4 */ @@ -601,7 +601,7 @@ u64 hipz_h_destroy_qp(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("Resource still in use. ret=%li", ret); + ehca_gen_err("Resource still in use. ret=%lli", ret); return ret; } @@ -636,7 +636,7 @@ u64 hipz_h_define_aqp1(const struct ipz_adapter_handle adapter_handle, *bma_qp_nr = (u32)outs[1]; if (ret == H_ALIAS_EXIST) - ehca_gen_err("AQP1 already exists. ret=%li", ret); + ehca_gen_err("AQP1 already exists. ret=%lli", ret); return ret; } @@ -658,7 +658,7 @@ u64 hipz_h_attach_mcqp(const struct ipz_adapter_handle adapter_handle, 0, 0); if (ret == H_NOT_ENOUGH_RESOURCES) - ehca_gen_err("Not enough resources. ret=%li", ret); + ehca_gen_err("Not enough resources. ret=%lli", ret); return ret; } @@ -697,7 +697,7 @@ u64 hipz_h_destroy_cq(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("H_FREE_RESOURCE failed ret=%li ", ret); + ehca_gen_err("H_FREE_RESOURCE failed ret=%lli ", ret); return ret; } @@ -719,7 +719,7 @@ u64 hipz_h_destroy_eq(const struct ipz_adapter_handle adapter_handle, 0, 0, 0, 0, 0); if (ret == H_RESOURCE) - ehca_gen_err("Resource in use. ret=%li ", ret); + ehca_gen_err("Resource in use. ret=%lli ", ret); return ret; } @@ -774,9 +774,9 @@ u64 hipz_h_register_rpage_mr(const struct ipz_adapter_handle adapter_handle, if ((count > 1) && (logical_address_of_page & (EHCA_PAGESIZE-1))) { ehca_gen_err("logical_address_of_page not on a 4k boundary " - "adapter_handle=%lx mr=%p mr_handle=%lx " + "adapter_handle=%llx mr=%p mr_handle=%llx " "pagesize=%x queue_type=%x " - "logical_address_of_page=%lx count=%lx", + "logical_address_of_page=%llx count=%llx", adapter_handle.handle, mr, mr->ipz_mr_handle.handle, pagesize, queue_type, logical_address_of_page, count); -- 1.6.0.4 From sfr at canb.auug.org.au Fri Jan 16 14:46:25 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Sat, 17 Jan 2009 09:46:25 +1100 Subject: [ofa-general] Re: [PATCH] infiniband/ehca: use consistent type In-Reply-To: References: <20081231141453.45d7f2c1.sfr@canb.auug.org.au> Message-ID: <20090117094625.1500c8e7.sfr@canb.auug.org.au> Hi Roland, On Fri, 16 Jan 2009 14:36:27 -0800 Roland Dreier wrote: > > I'm going to apply this for 2.6.29, since the change to the u64 type > went upstream, so ehca spews a bunch of warnings now. Thanks. > I'll also add the following patch to fix all the printk format warnings: > > IB/ehca: Fix printk format warnings from u64 type change Sorry to appear picky, but how is that different from the patch "powerpc: cleanup from powerpc l64 to ll64 change: drivers/infiniband" that I sent to you on Jan 7? -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From rdreier at cisco.com Fri Jan 16 14:51:30 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jan 2009 14:51:30 -0800 Subject: [ofa-general] Re: [PATCH] infiniband/ehca: use consistent type In-Reply-To: <20090117094625.1500c8e7.sfr@canb.auug.org.au> (Stephen Rothwell's message of "Sat, 17 Jan 2009 09:46:25 +1100") References: <20081231141453.45d7f2c1.sfr@canb.auug.org.au> <20090117094625.1500c8e7.sfr@canb.auug.org.au> Message-ID: > Sorry to appear picky, but how is that different from the patch "powerpc: > cleanup from powerpc l64 to ll64 change: drivers/infiniband" that I sent > to you on Jan 7? Well, I didn't lose my version and forget all about it ;) Seriously, I must have lost that patch, sorry about that. I'll dig it out and replace mine so you get credit. - R. From sfr at canb.auug.org.au Fri Jan 16 15:06:32 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Sat, 17 Jan 2009 10:06:32 +1100 Subject: [ofa-general] Re: [PATCH] infiniband/ehca: use consistent type In-Reply-To: References: <20081231141453.45d7f2c1.sfr@canb.auug.org.au> <20090117094625.1500c8e7.sfr@canb.auug.org.au> Message-ID: <20090117100632.61dc13ee.sfr@canb.auug.org.au> Hi Roland, On Fri, 16 Jan 2009 14:51:30 -0800 Roland Dreier wrote: > > > Sorry to appear picky, but how is that different from the patch "powerpc: > > cleanup from powerpc l64 to ll64 change: drivers/infiniband" that I sent > > to you on Jan 7? > > Well, I didn't lose my version and forget all about it ;) I figured it was something simple. > Seriously, I must have lost that patch, sorry about that. I'll dig it > out and replace mine so you get credit. Thanks. /me needs all the credit he can get :-) -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From rdreier at cisco.com Fri Jan 16 15:07:40 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 16 Jan 2009 15:07:40 -0800 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus Some fixes to go into 2.6.29-rc3. Nothing too urgent so if this takes a long time due to LCA, it's not a big deal. The bulk of things are for the ehca driver, to fix warning spew caused by the powerpc u64 transition from long to long long. Andrew Morton (1): mlx4_core: Fix min() warning Roland Dreier (4): IPoIB: Fix hang in napi_disable() if P_Key is never found IPoIB: Fix deadlock between ipoib_open() and child interface create IB/mlx4: Fix memory ordering problem when posting LSO sends Merge branches 'ehca', 'ipoib' and 'mlx4' into for-linus Stephen Rothwell (2): IB/ehca: Fix printk format warnings from u64 type change IB/ehca: Use consistent types for ehca_plpar_hcall9() Yossi Etigin (1): IPoIB: Do not print error messages for multicast join retries drivers/infiniband/hw/ehca/ehca_cq.c | 16 ++-- drivers/infiniband/hw/ehca/ehca_hca.c | 2 +- drivers/infiniband/hw/ehca/ehca_irq.c | 18 ++-- drivers/infiniband/hw/ehca/ehca_main.c | 6 +- drivers/infiniband/hw/ehca/ehca_mcast.c | 4 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 144 ++++++++++++------------ drivers/infiniband/hw/ehca/ehca_qp.c | 32 +++--- drivers/infiniband/hw/ehca/ehca_reqs.c | 2 +- drivers/infiniband/hw/ehca/ehca_sqp.c | 2 +- drivers/infiniband/hw/ehca/ehca_tools.h | 2 +- drivers/infiniband/hw/ehca/ehca_uverbs.c | 2 +- drivers/infiniband/hw/ehca/hcp_if.c | 56 +++++----- drivers/infiniband/hw/mlx4/qp.c | 28 +++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 27 +++-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 11 ++- drivers/net/mlx4/profile.c | 6 +- 17 files changed, 189 insertions(+), 171 deletions(-) From vlad at lists.openfabrics.org Sat Jan 17 03:12:09 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 17 Jan 2009 03:12:09 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090117-0200 daily build status Message-ID: <20090117111209.6F7E0E60B1D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From dotanba at gmail.com Sat Jan 17 03:38:01 2009 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 17 Jan 2009 13:38:01 +0200 Subject: ***SPAM*** Re: [ofa-general] local DMA transfer? In-Reply-To: References: Message-ID: <4971C319.9050405@gmail.com> Yicheng Jia wrote: > > Hi Folks, > > Is it possible to do local DMA transfer using QPs in a single port > HCA? Thanks! > > Best, Hi. If the question is can i use DMA from local QPs (both QPs are in the same HCA), then the answer is yes. The QPs can be located everywhere, even in the same HCA, or between two different HCAs in the same machine (as long as they are in the same IB fabric). Dotan From YJia at tmriusa.com Sat Jan 17 11:38:59 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Sat, 17 Jan 2009 13:38:59 -0600 Subject: [ofa-general] local DMA transfer? In-Reply-To: <4971C319.9050405@gmail.com> Message-ID: Hi Dotan, Does HCA provide any internal route for local DMA so that local data transfer doesn't has to go out of the HCA port as regular QPs do? In another word, it's not efficient to use QPs for local DMA transfer, is it true? Thanks! Yicheng Jia Software Engineer Toshiba Medical Research Institute USA, Inc. Phone: 847-573-6625 Fax: 847-367-5272 Dotan Barak 01/17/2009 05:38 AM To Yicheng Jia cc general at lists.openfabrics.org Subject Re: [ofa-general] local DMA transfer? Yicheng Jia wrote: > > Hi Folks, > > Is it possible to do local DMA transfer using QPs in a single port > HCA? Thanks! > > Best, Hi. If the question is can i use DMA from local QPs (both QPs are in the same HCA), then the answer is yes. The QPs can be located everywhere, even in the same HCA, or between two different HCAs in the same machine (as long as they are in the same IB fabric). Dotan _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Sat Jan 17 11:50:38 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Sat, 17 Jan 2009 11:50:38 -0800 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. Message-ID: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> Ok, here is revision 2 of the libibmad WinOF portability patches. I eliminated all #ifdef _WIN32 crap and tried to limit the changes by adding os dependent file mad_osd.h. With these changes we could share the same code base for both OFED and WinOF. Please review and consider accepting this patch set. [PATCH 1/3] libibmad: add os dependent definitions. [PATCH 2/3] field.c remove c99 definitions, better portability with WinOF. [PATCH 3/3] Minor changes to allow portability to WinOF infiniband/mad_osd.h added to provide support for os specific defintions for portability. With these changes, WinOF can pull directly from OFED git tree and share a common code base with minimal changes to mad.h and source tree. mad.h modifications include MAD_EXPORT for export declarations where appropriate. Datatype llu changed to ULL for 64bit constants. makefile.am modified to include new linux version of mad_osd.h Signed-off-by: Arlin Davis --- libibmad/Makefile.am | 7 +- libibmad/include/infiniband/mad.h | 120 ++++++++++++++++----------------- libibmad/include/infiniband/mad_osd.h | 48 +++++++++++++ 3 files changed, 109 insertions(+), 66 deletions(-) create mode 100644 libibmad/include/infiniband/mad_osd.h diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am index beae1a4..8dea157 100644 --- a/libibmad/Makefile.am +++ b/libibmad/Makefile.am @@ -1,7 +1,7 @@ SUBDIRS = . -INCLUDES = -I$(srcdir)/include/infiniband -I$(includedir) +INCLUDES = -I$(srcdir)/include -I$(srcdir)/include/infiniband -I$(includedir) lib_LTLIBRARIES = libibmad.la @@ -23,9 +23,10 @@ libibmad_la_DEPENDENCIES = $(srcdir)/src/libibmad.map libibmadincludedir = $(includedir)/infiniband -libibmadinclude_HEADERS = $(srcdir)/include/infiniband/mad.h +libibmadinclude_HEADERS = $(srcdir)/include/infiniband/mad.h $(srcdir)/include/infiniband/mad_osd.h -EXTRA_DIST = $(srcdir)/include/infiniband/mad.h libibmad.spec.in libibmad.spec \ +EXTRA_DIST = $(srcdir)/include/infiniband/mad.h $(srcdir)/include/infiniband/mad_osd.h \ + libibmad.spec.in libibmad.spec \ $(srcdir)/src/libibmad.map libibmad.ver autogen.sh dist-hook: diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 0a962c0..fe607a7 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -33,13 +33,7 @@ #ifndef _MAD_H_ #define _MAD_H_ -#include -#include -#include -#include -#include -#include -#include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -52,7 +46,7 @@ BEGIN_C_DECLS #define IB_SUBNET_PATH_HOPS_MAX 64 -#define IB_DEFAULT_SUBN_PREFIX 0xfe80000000000000llu +#define IB_DEFAULT_SUBN_PREFIX 0xfe80000000000000ULL #define IB_DEFAULT_QP1_QKEY 0x80010000 #define IB_MAD_SIZE 256 @@ -627,10 +621,10 @@ enum { /******************************************************************************/ /* portid.c */ -char * portid2str(ib_portid_t *portid); -int portid2portnum(ib_portid_t *portid); -int str2drpath(ib_dr_path_t *path, char *routepath, int drslid, int drdlid); -char * drpath2str(ib_dr_path_t *path, char *dstr, size_t dstr_size); +MAD_EXPORT char * portid2str(ib_portid_t *portid); +MAD_EXPORT int portid2portnum(ib_portid_t *portid); +MAD_EXPORT int str2drpath(ib_dr_path_t *path, char *routepath, int drslid, int drdlid); +MAD_EXPORT char * drpath2str(ib_dr_path_t *path, char *dstr, size_t dstr_size); static inline int ib_portid_set(ib_portid_t *portid, int lid, int qp, int qkey) @@ -644,44 +638,44 @@ ib_portid_set(ib_portid_t *portid, int lid, int qp, int qkey) } /* fields.c */ -uint32_t mad_get_field(void *buf, int base_offs, int field); -void mad_set_field(void *buf, int base_offs, int field, uint32_t val); +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field); +MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field, uint32_t val); /* field must be byte aligned */ -uint64_t mad_get_field64(void *buf, int base_offs, int field); -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val); -void mad_set_array(void *buf, int base_offs, int field, void *val); -void mad_get_array(void *buf, int base_offs, int field, void *val); -void mad_decode_field(uint8_t *buf, int field, void *val); -void mad_encode_field(uint8_t *buf, int field, void *val); -int mad_print_field(int field, const char *name, void *val); -char *mad_dump_field(int field, char *buf, int bufsz, void *val); -char *mad_dump_val(int field, char *buf, int bufsz, void *val); +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field); +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field, uint64_t val); +MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val); +MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val); +MAD_EXPORT void mad_decode_field(uint8_t *buf, int field, void *val); +MAD_EXPORT void mad_encode_field(uint8_t *buf, int field, void *val); +MAD_EXPORT int mad_print_field(int field, const char *name, void *val); +MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val); +MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val); /* mad.c */ -void *mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, void *data); -uint64_t mad_trid(void); -int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); +MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, void *data); +MAD_EXPORT uint64_t mad_trid(void); +MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); /* register.c */ -int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); -int mad_register_client(int mgmt, uint8_t rmpp_version); -int mad_register_server(int mgmt, uint8_t rmpp_version, +MAD_EXPORT int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); +MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version); +MAD_EXPORT int mad_register_server(int mgmt, uint8_t rmpp_version, long method_mask[16/sizeof(long)], uint32_t class_oui); -int mad_class_agent(int mgmt); -int mad_agent_class(int agent); +MAD_EXPORT int mad_class_agent(int mgmt); +MAD_EXPORT int mad_agent_class(int agent); /* serv.c */ -int mad_send(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, +MAD_EXPORT int mad_send(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); -void * mad_receive(void *umad, int timeout); -int mad_respond(void *umad, ib_portid_t *portid, uint32_t rstatus); -void * mad_alloc(void); -void mad_free(void *umad); +MAD_EXPORT void * mad_receive(void *umad, int timeout); +MAD_EXPORT int mad_respond(void *umad, ib_portid_t *portid, uint32_t rstatus); +MAD_EXPORT void * mad_alloc(void); +MAD_EXPORT void mad_free(void *umad); /* vendor.c */ -uint8_t *ib_vendor_call(void *data, ib_portid_t *portid, - ib_vendor_call_t *call); +MAD_EXPORT uint8_t *ib_vendor_call(void *data, ib_portid_t *portid, + ib_vendor_call_t *call); static inline int mad_is_vendor_range1(int mgmt) @@ -696,29 +690,29 @@ mad_is_vendor_range2(int mgmt) } /* rpc.c */ -int madrpc_portid(void); -int madrpc_set_retries(int retries); -int madrpc_set_timeout(int timeout); -void * madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata); -void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, +MAD_EXPORT int madrpc_portid(void); +MAD_EXPORT int madrpc_set_retries(int retries); +MAD_EXPORT int madrpc_set_timeout(int timeout); +void * madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata); +void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); -void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, +MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes); void madrpc_save_mad(void *madbuf, int len); -void madrpc_show_errors(int set); +MAD_EXPORT void madrpc_show_errors(int set); -void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, +void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, int num_classes); void mad_rpc_close_port(void *ibmad_port); -void * mad_rpc(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, +void * mad_rpc(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata); -void * mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, +void * mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); /* smp.c */ -uint8_t * smp_query(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, +MAD_EXPORT uint8_t * smp_query(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, unsigned timeout); -uint8_t * smp_set(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, +MAD_EXPORT uint8_t * smp_set(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, unsigned timeout); uint8_t * smp_query_via(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, unsigned timeout, const void *srcport); @@ -730,18 +724,18 @@ uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, unsigned timeout); uint8_t * sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, unsigned timeout); -int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id, +MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id, void *buf); /* returns lid */ int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id, void *buf); /* resolve.c */ -int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); -int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, +MAD_EXPORT int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); +MAD_EXPORT int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, ib_portid_t *sm_id, int timeout); -int ib_resolve_portid_str(ib_portid_t *portid, char *addr_str, +MAD_EXPORT int ib_resolve_portid_str(ib_portid_t *portid, char *addr_str, int dest_type, ib_portid_t *sm_id); -int ib_resolve_self(ib_portid_t *portid, int *portnum, ibmad_gid_t *gid); +MAD_EXPORT int ib_resolve_self(ib_portid_t *portid, int *portnum, ibmad_gid_t *gid); int ib_resolve_smlid_via(ib_portid_t *sm_id, int timeout, const void *srcport); @@ -755,19 +749,19 @@ int ib_resolve_self_via(ib_portid_t *portid, int *portnum, ibmad_gid_t *gid, const void *srcport); /* gs.c */ -uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t *dest, int port, +MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t *dest, int port, unsigned timeout); -uint8_t *port_performance_query(void *rcvbuf, ib_portid_t *dest, int port, +MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t *dest, int port, unsigned timeout); -uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t *dest, int port, +MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t *dest, int port, unsigned mask, unsigned timeout); -uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t *dest, int port, +MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t *dest, int port, unsigned timeout); -uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t *dest, int port, +MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t *dest, int port, unsigned mask, unsigned timeout); -uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t *dest, int port, +MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t *dest, int port, unsigned timeout); -uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t *dest, int port, +MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t *dest, int port, unsigned timeout); uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t *dest, int port, @@ -785,7 +779,7 @@ uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t *dest, int por uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t *dest, int port, unsigned timeout, const void *srcport); /* dump.c */ -ib_mad_dump_fn +MAD_EXPORT ib_mad_dump_fn mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, mad_dump_bitfield, mad_dump_array, mad_dump_string, mad_dump_linkwidth, mad_dump_linkwidthsup, mad_dump_linkwidthen, diff --git a/libibmad/include/infiniband/mad_osd.h b/libibmad/include/infiniband/mad_osd.h new file mode 100644 index 0000000..45741c5 --- /dev/null +++ b/libibmad/include/infiniband/mad_osd.h @@ -0,0 +1,49 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2009 Intel Corporation All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#ifndef _MAD_OSD_H_ +#define _MAD_OSD_H_ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define MAD_EXPORT + +#endif /* _MAD_OSD_H_ */ -- 1.5.2.5 From arlin.r.davis at intel.com Sat Jan 17 11:51:12 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sat, 17 Jan 2009 11:51:12 -0800 Subject: [ofa-general] [PATCH 2/3 v2] field.c, remove c99 definitions, better portability with WinOF. Message-ID: - Remove c99 definitions in the ib_mad_f structure. - Remove unnecessary include file - _mad_dump: remove c99 structure initialization. Signed-off-by: Arlin Davis --- libibmad/src/fields.c | 430 +++++++++++++++++++++++++----------------------- 1 files changed, 224 insertions(+), 206 deletions(-) diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index 5cebd01..17d0a03 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -37,7 +37,6 @@ #include #include -#include #include #include @@ -54,10 +53,10 @@ #define BE_TO_BITSOFFS(o, w) (((o) & ~31) | ((32 - ((o) & 31) - (w)))) static const ib_field_t ib_mad_f [] = { - [0] {0, 0}, /* IB_NO_FIELD - reserved as invalid */ + {0, 0}, /* IB_NO_FIELD - reserved as invalid */ - [IB_GID_PREFIX_F] {0, 64, "GidPrefix", mad_dump_rhex}, - [IB_GID_GUID_F] {64, 64, "GidGuid", mad_dump_rhex}, + {0, 64, "GidPrefix", mad_dump_rhex}, + {64, 64, "GidGuid", mad_dump_rhex}, /* * MAD: common MAD fields (IB spec 13.4.2) @@ -67,300 +66,316 @@ static const ib_field_t ib_mad_f [] = { */ /* first MAD word (0-3 bytes) */ - [IB_MAD_METHOD_F] {BE_OFFS(0, 7), "MadMethod", mad_dump_hex}, /* TODO: add dumper */ - [IB_MAD_RESPONSE_F] {BE_OFFS(7, 1), "MadIsResponse", mad_dump_uint}, /* TODO: add dumper */ - [IB_MAD_CLASSVER_F] {BE_OFFS(8, 8), "MadClassVersion", mad_dump_uint}, - [IB_MAD_MGMTCLASS_F] {BE_OFFS(16, 8), "MadMgmtClass", mad_dump_uint}, /* TODO: add dumper */ - [IB_MAD_BASEVER_F] {BE_OFFS(24, 8), "MadBaseVersion", mad_dump_uint}, + {BE_OFFS(0, 7), "MadMethod", mad_dump_hex}, /* TODO: add dumper */ + {BE_OFFS(7, 1), "MadIsResponse", mad_dump_uint}, /* TODO: add dumper */ + {BE_OFFS(8, 8), "MadClassVersion", mad_dump_uint}, + {BE_OFFS(16, 8), "MadMgmtClass", mad_dump_uint}, /* TODO: add dumper */ + {BE_OFFS(24, 8), "MadBaseVersion", mad_dump_uint}, /* second MAD word (4-7 bytes) */ - [IB_MAD_STATUS_F] {BE_OFFS(48, 16), "MadStatus", mad_dump_hex}, /* TODO: add dumper */ + {BE_OFFS(48, 16), "MadStatus", mad_dump_hex}, /* TODO: add dumper */ /* DR SMP only */ - [IB_DRSMP_HOPCNT_F] {BE_OFFS(32, 8), "DrSmpHopCnt", mad_dump_uint}, - [IB_DRSMP_HOPPTR_F] {BE_OFFS(40, 8), "DrSmpHopPtr", mad_dump_uint}, - [IB_DRSMP_STATUS_F] {BE_OFFS(48, 15), "DrSmpStatus", mad_dump_hex}, /* TODO: add dumper */ - [IB_DRSMP_DIRECTION_F] {BE_OFFS(63, 1), "DrSmpDirection", mad_dump_uint}, /* TODO: add dumper */ + {BE_OFFS(32, 8), "DrSmpHopCnt", mad_dump_uint}, + {BE_OFFS(40, 8), "DrSmpHopPtr", mad_dump_uint}, + {BE_OFFS(48, 15), "DrSmpStatus", mad_dump_hex}, /* TODO: add dumper */ + {BE_OFFS(63, 1), "DrSmpDirection", mad_dump_uint}, /* TODO: add dumper */ /* words 3,4,5,6 (8-23 bytes) */ - [IB_MAD_TRID_F] {64, 64, "MadTRID", mad_dump_hex}, - [IB_MAD_ATTRID_F] {BE_OFFS(144, 16), "MadAttr", mad_dump_hex}, /* TODO: add dumper */ - [IB_MAD_ATTRMOD_F] {160, 32, "MadModifier", mad_dump_hex}, /* TODO: add dumper */ + {64, 64, "MadTRID", mad_dump_hex}, + {BE_OFFS(144, 16), "MadAttr", mad_dump_hex}, /* TODO: add dumper */ + {160, 32, "MadModifier", mad_dump_hex}, /* TODO: add dumper */ /* word 7,8 (24-31 bytes) */ - [IB_MAD_MKEY_F] {196, 64, "MadMkey", mad_dump_hex}, + {196, 64, "MadMkey", mad_dump_hex}, /* word 9 (32-37 bytes) */ - [IB_DRSMP_DRDLID_F] {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, - [IB_DRSMP_DRSLID_F] {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, + {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, + {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, + + /* word 10,11 (36-43 bytes) */ + {0, 0}, /* IB_SA_MKEY_F - reserved as invalid */ /* word 12 (44-47 bytes) */ - [IB_SA_ATTROFFS_F] {BE_OFFS(46*8, 16), "SaAttrOffs", mad_dump_uint}, + {BE_OFFS(46*8, 16), "SaAttrOffs", mad_dump_uint}, /* word 13,14 (48-55 bytes) */ - [IB_SA_COMPMASK_F] {48*8, 64, "SaCompMask", mad_dump_hex}, + {48*8, 64, "SaCompMask", mad_dump_hex}, /* word 13,14 (56-255 bytes) */ - [IB_SA_DATA_F] {56*8, (256-56)*8, "SaData", mad_dump_hex}, - - [IB_DRSMP_PATH_F] {1024, 512, "DrSmpPath", mad_dump_hex}, - [IB_DRSMP_RPATH_F] {1536, 512, "DrSmpRetPath", mad_dump_hex}, - - [IB_GS_DATA_F] {64*8, (256-64) * 8, "GsData", mad_dump_hex}, - + {56*8, (256-56)*8, "SaData", mad_dump_hex}, + + /* bytes 64 - 127 */ + {0, 0}, /* IB_SM_DATA_F - reserved as invalid */ + + /* bytes 64 - 256 */ + {64*8, (256-64) * 8, "GsData", mad_dump_hex}, + + /* bytes 128 - 191 */ + {1024, 512, "DrSmpPath", mad_dump_hex}, + + /* bytes 192 - 255 */ + {1536, 512, "DrSmpRetPath", mad_dump_hex}, + /* * PortInfo fields: */ - [IB_PORT_MKEY_F] {0, 64, "Mkey", mad_dump_hex}, - [IB_PORT_GID_PREFIX_F] {64, 64, "GidPrefix", mad_dump_hex}, - [IB_PORT_LID_F] {BITSOFFS(128, 16), "Lid", mad_dump_hex}, - [IB_PORT_SMLID_F] {BITSOFFS(144, 16), "SMLid", mad_dump_hex}, - [IB_PORT_CAPMASK_F] {160, 32, "CapMask", mad_dump_portcapmask}, - [IB_PORT_DIAG_F] {BITSOFFS(192, 16), "DiagCode", mad_dump_hex}, - [IB_PORT_MKEY_LEASE_F] {BITSOFFS(208, 16), "MkeyLeasePeriod", mad_dump_uint}, - [IB_PORT_LOCAL_PORT_F] {BITSOFFS(224, 8), "LocalPort", mad_dump_uint}, - [IB_PORT_LINK_WIDTH_ENABLED_F] {BITSOFFS(232, 8), "LinkWidthEnabled", mad_dump_linkwidthen}, - [IB_PORT_LINK_WIDTH_SUPPORTED_F] {BITSOFFS(240, 8), "LinkWidthSupported", mad_dump_linkwidthsup}, - [IB_PORT_LINK_WIDTH_ACTIVE_F] {BITSOFFS(248, 8), "LinkWidthActive", mad_dump_linkwidth}, - [IB_PORT_LINK_SPEED_SUPPORTED_F] {BITSOFFS(256, 4), "LinkSpeedSupported", mad_dump_linkspeedsup}, - [IB_PORT_STATE_F] {BITSOFFS(260, 4), "LinkState", mad_dump_portstate}, - [IB_PORT_PHYS_STATE_F] {BITSOFFS(264, 4), "PhysLinkState", mad_dump_physportstate}, - [IB_PORT_LINK_DOWN_DEF_F] {BITSOFFS(268, 4), "LinkDownDefState", mad_dump_linkdowndefstate}, - [IB_PORT_MKEY_PROT_BITS_F] {BITSOFFS(272, 2), "ProtectBits", mad_dump_uint}, - [IB_PORT_LMC_F] {BITSOFFS(277, 3), "LMC", mad_dump_uint}, - [IB_PORT_LINK_SPEED_ACTIVE_F] {BITSOFFS(280, 4), "LinkSpeedActive", mad_dump_linkspeed}, - [IB_PORT_LINK_SPEED_ENABLED_F] {BITSOFFS(284, 4), "LinkSpeedEnabled", mad_dump_linkspeeden}, - [IB_PORT_NEIGHBOR_MTU_F] {BITSOFFS(288, 4), "NeighborMTU", mad_dump_mtu}, - [IB_PORT_SMSL_F] {BITSOFFS(292, 4), "SMSL", mad_dump_uint}, - [IB_PORT_VL_CAP_F] {BITSOFFS(296, 4), "VLCap", mad_dump_vlcap}, - [IB_PORT_INIT_TYPE_F] {BITSOFFS(300, 4), "InitType", mad_dump_hex}, - [IB_PORT_VL_HIGH_LIMIT_F] {BITSOFFS(304, 8), "VLHighLimit", mad_dump_uint}, - [IB_PORT_VL_ARBITRATION_HIGH_CAP_F] {BITSOFFS(312, 8), "VLArbHighCap", mad_dump_uint}, - [IB_PORT_VL_ARBITRATION_LOW_CAP_F] {BITSOFFS(320, 8), "VLArbLowCap", mad_dump_uint}, - - [IB_PORT_INIT_TYPE_REPLY_F] {BITSOFFS(328, 4), "InitReply", mad_dump_hex}, - [IB_PORT_MTU_CAP_F] {BITSOFFS(332, 4), "MtuCap", mad_dump_mtu}, - [IB_PORT_VL_STALL_COUNT_F] {BITSOFFS(336, 3), "VLStallCount", mad_dump_uint}, - [IB_PORT_HOQ_LIFE_F] {BITSOFFS(339, 5), "HoqLife", mad_dump_uint}, - [IB_PORT_OPER_VLS_F] {BITSOFFS(344, 4), "OperVLs", mad_dump_opervls}, - [IB_PORT_PART_EN_INB_F] {BITSOFFS(348, 1), "PartEnforceInb", mad_dump_uint}, - [IB_PORT_PART_EN_OUTB_F] {BITSOFFS(349, 1), "PartEnforceOutb", mad_dump_uint}, - [IB_PORT_FILTER_RAW_INB_F] {BITSOFFS(350, 1), "FilterRawInb", mad_dump_uint}, - [IB_PORT_FILTER_RAW_OUTB_F] {BITSOFFS(351, 1), "FilterRawOutb", mad_dump_uint}, - [IB_PORT_MKEY_VIOL_F] {BITSOFFS(352, 16), "MkeyViolations", mad_dump_uint}, - [IB_PORT_PKEY_VIOL_F] {BITSOFFS(368, 16), "PkeyViolations", mad_dump_uint}, - [IB_PORT_QKEY_VIOL_F] {BITSOFFS(384, 16), "QkeyViolations", mad_dump_uint}, - [IB_PORT_GUID_CAP_F] {BITSOFFS(400, 8), "GuidCap", mad_dump_uint}, - [IB_PORT_CLIENT_REREG_F] {BITSOFFS(408, 1), "ClientReregister", mad_dump_uint}, - [IB_PORT_SUBN_TIMEOUT_F] {BITSOFFS(411, 5), "SubnetTimeout", mad_dump_uint}, - [IB_PORT_RESP_TIME_VAL_F] {BITSOFFS(419, 5), "RespTimeVal", mad_dump_uint}, - [IB_PORT_LOCAL_PHYS_ERR_F] {BITSOFFS(424, 4), "LocalPhysErr", mad_dump_uint}, - [IB_PORT_OVERRUN_ERR_F] {BITSOFFS(428, 4), "OverrunErr", mad_dump_uint}, - [IB_PORT_MAX_CREDIT_HINT_F] {BITSOFFS(432, 16), "MaxCreditHint", mad_dump_uint}, - [IB_PORT_LINK_ROUND_TRIP_F] {BITSOFFS(456, 24), "RoundTrip", mad_dump_uint}, + {0, 64, "Mkey", mad_dump_hex}, + {64, 64, "GidPrefix", mad_dump_hex}, + {BITSOFFS(128, 16), "Lid", mad_dump_hex}, + {BITSOFFS(144, 16), "SMLid", mad_dump_hex}, + {160, 32, "CapMask", mad_dump_portcapmask}, + {BITSOFFS(192, 16), "DiagCode", mad_dump_hex}, + {BITSOFFS(208, 16), "MkeyLeasePeriod", mad_dump_uint}, + {BITSOFFS(224, 8), "LocalPort", mad_dump_uint}, + {BITSOFFS(232, 8), "LinkWidthEnabled", mad_dump_linkwidthen}, + {BITSOFFS(240, 8), "LinkWidthSupported", mad_dump_linkwidthsup}, + {BITSOFFS(248, 8), "LinkWidthActive", mad_dump_linkwidth}, + {BITSOFFS(256, 4), "LinkSpeedSupported", mad_dump_linkspeedsup}, + {BITSOFFS(260, 4), "LinkState", mad_dump_portstate}, + {BITSOFFS(264, 4), "PhysLinkState", mad_dump_physportstate}, + {BITSOFFS(268, 4), "LinkDownDefState", mad_dump_linkdowndefstate}, + {BITSOFFS(272, 2), "ProtectBits", mad_dump_uint}, + {BITSOFFS(277, 3), "LMC", mad_dump_uint}, + {BITSOFFS(280, 4), "LinkSpeedActive", mad_dump_linkspeed}, + {BITSOFFS(284, 4), "LinkSpeedEnabled", mad_dump_linkspeeden}, + {BITSOFFS(288, 4), "NeighborMTU", mad_dump_mtu}, + {BITSOFFS(292, 4), "SMSL", mad_dump_uint}, + {BITSOFFS(296, 4), "VLCap", mad_dump_vlcap}, + {BITSOFFS(300, 4), "InitType", mad_dump_hex}, + {BITSOFFS(304, 8), "VLHighLimit", mad_dump_uint}, + {BITSOFFS(312, 8), "VLArbHighCap", mad_dump_uint}, + {BITSOFFS(320, 8), "VLArbLowCap", mad_dump_uint}, + {BITSOFFS(328, 4), "InitReply", mad_dump_hex}, + {BITSOFFS(332, 4), "MtuCap", mad_dump_mtu}, + {BITSOFFS(336, 3), "VLStallCount", mad_dump_uint}, + {BITSOFFS(339, 5), "HoqLife", mad_dump_uint}, + {BITSOFFS(344, 4), "OperVLs", mad_dump_opervls}, + {BITSOFFS(348, 1), "PartEnforceInb", mad_dump_uint}, + {BITSOFFS(349, 1), "PartEnforceOutb", mad_dump_uint}, + {BITSOFFS(350, 1), "FilterRawInb", mad_dump_uint}, + {BITSOFFS(351, 1), "FilterRawOutb", mad_dump_uint}, + {BITSOFFS(352, 16), "MkeyViolations", mad_dump_uint}, + {BITSOFFS(368, 16), "PkeyViolations", mad_dump_uint}, + {BITSOFFS(384, 16), "QkeyViolations", mad_dump_uint}, + {BITSOFFS(400, 8), "GuidCap", mad_dump_uint}, + {BITSOFFS(408, 1), "ClientReregister", mad_dump_uint}, + {BITSOFFS(411, 5), "SubnetTimeout", mad_dump_uint}, + {BITSOFFS(419, 5), "RespTimeVal", mad_dump_uint}, + {BITSOFFS(424, 4), "LocalPhysErr", mad_dump_uint}, + {BITSOFFS(428, 4), "OverrunErr", mad_dump_uint}, + {BITSOFFS(432, 16), "MaxCreditHint", mad_dump_uint}, + {BITSOFFS(456, 24), "RoundTrip", mad_dump_uint}, + {0, 0}, /* IB_PORT_LAST_F */ /* * NodeInfo fields: */ - [IB_NODE_BASE_VERS_F] {BITSOFFS(0,8), "BaseVers", mad_dump_uint}, - [IB_NODE_CLASS_VERS_F] {BITSOFFS(8,8), "ClassVers", mad_dump_uint}, - [IB_NODE_TYPE_F] {BITSOFFS(16,8), "NodeType", mad_dump_node_type}, - [IB_NODE_NPORTS_F] {BITSOFFS(24,8), "NumPorts", mad_dump_uint}, - [IB_NODE_SYSTEM_GUID_F] {32, 64, "SystemGuid", mad_dump_hex}, - [IB_NODE_GUID_F] {96, 64, "Guid", mad_dump_hex}, - [IB_NODE_PORT_GUID_F] {160, 64, "PortGuid", mad_dump_hex}, - [IB_NODE_PARTITION_CAP_F] {BITSOFFS(224,16), "PartCap", mad_dump_uint}, - [IB_NODE_DEVID_F] {BITSOFFS(240,16), "DevId", mad_dump_hex}, - [IB_NODE_REVISION_F] {256, 32, "Revision", mad_dump_hex}, - [IB_NODE_LOCAL_PORT_F] {BITSOFFS(288,8), "LocalPort", mad_dump_uint}, - [IB_NODE_VENDORID_F] {BITSOFFS(296,24), "VendorId", mad_dump_hex}, + {BITSOFFS(0,8), "BaseVers", mad_dump_uint}, + {BITSOFFS(8,8), "ClassVers", mad_dump_uint}, + {BITSOFFS(16,8), "NodeType", mad_dump_node_type}, + {BITSOFFS(24,8), "NumPorts", mad_dump_uint}, + {32, 64, "SystemGuid", mad_dump_hex}, + {96, 64, "Guid", mad_dump_hex}, + {160, 64, "PortGuid", mad_dump_hex}, + {BITSOFFS(224,16), "PartCap", mad_dump_uint}, + {BITSOFFS(240,16), "DevId", mad_dump_hex}, + {256, 32, "Revision", mad_dump_hex}, + {BITSOFFS(288,8), "LocalPort", mad_dump_uint}, + {BITSOFFS(296,24), "VendorId", mad_dump_hex}, + {0, 0}, /* IB_NODE_LAST_F */ + /* * SwitchInfo fields: */ - [IB_SW_LINEAR_FDB_CAP_F] {BITSOFFS(0, 16), "LinearFdbCap", mad_dump_uint}, - [IB_SW_RANDOM_FDB_CAP_F] {BITSOFFS(16, 16), "RandomFdbCap", mad_dump_uint}, - [IB_SW_MCAST_FDB_CAP_F] {BITSOFFS(32, 16), "McastFdbCap", mad_dump_uint}, - [IB_SW_LINEAR_FDB_TOP_F] {BITSOFFS(48, 16), "LinearFdbTop", mad_dump_uint}, - [IB_SW_DEF_PORT_F] {BITSOFFS(64, 8), "DefPort", mad_dump_uint}, - [IB_SW_DEF_MCAST_PRIM_F] {BITSOFFS(72, 8), "DefMcastPrimPort", mad_dump_uint}, - [IB_SW_DEF_MCAST_NOT_PRIM_F] {BITSOFFS(80, 8), "DefMcastNotPrimPort", mad_dump_uint}, - [IB_SW_LIFE_TIME_F] {BITSOFFS(88, 5), "LifeTime", mad_dump_uint}, - [IB_SW_STATE_CHANGE_F] {BITSOFFS(93, 1), "StateChange", mad_dump_uint}, - [IB_SW_LIDS_PER_PORT_F] {BITSOFFS(96,16), "LidsPerPort", mad_dump_uint}, - [IB_SW_PARTITION_ENFORCE_CAP_F] {BITSOFFS(112, 16), "PartEnforceCap", mad_dump_uint}, - [IB_SW_PARTITION_ENF_INB_F] {BITSOFFS(128, 1), "InboundPartEnf", mad_dump_uint}, - [IB_SW_PARTITION_ENF_OUTB_F] {BITSOFFS(129, 1), "OutboundPartEnf", mad_dump_uint}, - [IB_SW_FILTER_RAW_INB_F] {BITSOFFS(130, 1), "FilterRawInbound", mad_dump_uint}, - [IB_SW_FILTER_RAW_OUTB_F] {BITSOFFS(131, 1), "FilterRawOutbound", mad_dump_uint}, - [IB_SW_ENHANCED_PORT0_F] {BITSOFFS(132, 1), "EnhancedPort0", mad_dump_uint}, + {BITSOFFS(0, 16), "LinearFdbCap", mad_dump_uint}, + {BITSOFFS(16, 16), "RandomFdbCap", mad_dump_uint}, + {BITSOFFS(32, 16), "McastFdbCap", mad_dump_uint}, + {BITSOFFS(48, 16), "LinearFdbTop", mad_dump_uint}, + {BITSOFFS(64, 8), "DefPort", mad_dump_uint}, + {BITSOFFS(72, 8), "DefMcastPrimPort", mad_dump_uint}, + {BITSOFFS(80, 8), "DefMcastNotPrimPort", mad_dump_uint}, + {BITSOFFS(88, 5), "LifeTime", mad_dump_uint}, + {BITSOFFS(93, 1), "StateChange", mad_dump_uint}, + {BITSOFFS(96,16), "LidsPerPort", mad_dump_uint}, + {BITSOFFS(112, 16), "PartEnforceCap", mad_dump_uint}, + {BITSOFFS(128, 1), "InboundPartEnf", mad_dump_uint}, + {BITSOFFS(129, 1), "OutboundPartEnf", mad_dump_uint}, + {BITSOFFS(130, 1), "FilterRawInbound", mad_dump_uint}, + {BITSOFFS(131, 1), "FilterRawOutbound", mad_dump_uint}, + {BITSOFFS(132, 1), "EnhancedPort0", mad_dump_uint}, + {0, 0}, /* IB_SW_LAST_F */ /* * SwitchLinearForwardingTable fields: */ - [IB_LINEAR_FORW_TBL_F] {0, 512, "LinearForwTbl", mad_dump_array}, + {0, 512, "LinearForwTbl", mad_dump_array}, /* * SwitchMulticastForwardingTable fields: */ - [IB_MULTICAST_FORW_TBL_F] {0, 512, "MulticastForwTbl", mad_dump_array}, + {0, 512, "MulticastForwTbl", mad_dump_array}, /* - * Notice/Trap fields + * NodeDescription fields: */ - [IB_NOTICE_IS_GENERIC_F] {BITSOFFS(0, 1), "NoticeIsGeneric", mad_dump_uint}, - [IB_NOTICE_TYPE_F] {BITSOFFS(1, 7), "NoticeType", mad_dump_uint}, - [IB_NOTICE_PRODUCER_F] {BITSOFFS(8, 24), "NoticeProducerType", mad_dump_node_type}, - [IB_NOTICE_TRAP_NUMBER_F] {BITSOFFS(32, 16), "NoticeTrapNumber", mad_dump_uint}, - [IB_NOTICE_ISSUER_LID_F] {BITSOFFS(48, 16), "NoticeIssuerLID", mad_dump_uint}, - [IB_NOTICE_TOGGLE_F] {BITSOFFS(64, 1), "NoticeToggle", mad_dump_uint}, - [IB_NOTICE_COUNT_F] {BITSOFFS(65, 15), "NoticeCount", mad_dump_uint}, - [IB_NOTICE_DATA_DETAILS_F] {80, 432, "NoticeDataDetails", mad_dump_array}, - [IB_NOTICE_DATA_LID_F] {BITSOFFS(80, 16), "NoticeDataLID", mad_dump_uint}, - [IB_NOTICE_DATA_144_LID_F] {BITSOFFS(96, 16), "NoticeDataTrap144LID", mad_dump_uint}, - [IB_NOTICE_DATA_144_CAPMASK_F] {BITSOFFS(128, 32), "NoticeDataTrap144CapMask", mad_dump_uint}, + {0, 64*8, "NodeDesc", mad_dump_string}, /* - * NodeDescription fields: + * Notice/Trap fields */ - [IB_NODE_DESC_F] {0, 64*8, "NodeDesc", mad_dump_string}, + {BITSOFFS(0, 1), "NoticeIsGeneric", mad_dump_uint}, + {BITSOFFS(1, 7), "NoticeType", mad_dump_uint}, + {BITSOFFS(8, 24), "NoticeProducerType", mad_dump_node_type}, + {BITSOFFS(32, 16), "NoticeTrapNumber", mad_dump_uint}, + {BITSOFFS(48, 16), "NoticeIssuerLID", mad_dump_uint}, + {BITSOFFS(64, 1), "NoticeToggle", mad_dump_uint}, + {BITSOFFS(65, 15), "NoticeCount", mad_dump_uint}, + {80, 432, "NoticeDataDetails", mad_dump_array}, + {BITSOFFS(80, 16), "NoticeDataLID", mad_dump_uint}, + {BITSOFFS(96, 16), "NoticeDataTrap144LID", mad_dump_uint}, + {BITSOFFS(128, 32), "NoticeDataTrap144CapMask", mad_dump_uint}, /* * Port counters */ - [IB_PC_PORT_SELECT_F] {BITSOFFS(8, 8), "PortSelect", mad_dump_uint}, - [IB_PC_COUNTER_SELECT_F] {BITSOFFS(16, 16), "CounterSelect", mad_dump_hex}, - [IB_PC_ERR_SYM_F] {BITSOFFS(32, 16), "SymbolErrors", mad_dump_uint}, - [IB_PC_LINK_RECOVERS_F] {BITSOFFS(48, 8), "LinkRecovers", mad_dump_uint}, - [IB_PC_LINK_DOWNED_F] {BITSOFFS(56, 8), "LinkDowned", mad_dump_uint}, - [IB_PC_ERR_RCV_F] {BITSOFFS(64, 16), "RcvErrors", mad_dump_uint}, - [IB_PC_ERR_PHYSRCV_F] {BITSOFFS(80, 16), "RcvRemotePhysErrors", mad_dump_uint}, - [IB_PC_ERR_SWITCH_REL_F] {BITSOFFS(96, 16), "RcvSwRelayErrors", mad_dump_uint}, - [IB_PC_XMT_DISCARDS_F] {BITSOFFS(112, 16), "XmtDiscards", mad_dump_uint}, - [IB_PC_ERR_XMTCONSTR_F] {BITSOFFS(128, 8), "XmtConstraintErrors", mad_dump_uint}, - [IB_PC_ERR_RCVCONSTR_F] {BITSOFFS(136, 8), "RcvConstraintErrors", mad_dump_uint}, - [IB_PC_COUNTER_SELECT2_F] {BITSOFFS(144, 8), "CounterSelect2", mad_dump_uint}, - [IB_PC_ERR_LOCALINTEG_F] {BITSOFFS(152, 4), "LinkIntegrityErrors", mad_dump_uint}, - [IB_PC_ERR_EXCESS_OVR_F] {BITSOFFS(156, 4), "ExcBufOverrunErrors", mad_dump_uint}, - [IB_PC_VL15_DROPPED_F] {BITSOFFS(176, 16), "VL15Dropped", mad_dump_uint}, - [IB_PC_XMT_BYTES_F] {192, 32, "XmtData", mad_dump_uint}, - [IB_PC_RCV_BYTES_F] {224, 32, "RcvData", mad_dump_uint}, - [IB_PC_XMT_PKTS_F] {256, 32, "XmtPkts", mad_dump_uint}, - [IB_PC_RCV_PKTS_F] {288, 32, "RcvPkts", mad_dump_uint}, - [IB_PC_XMT_WAIT_F] {320, 32, "XmtWait", mad_dump_uint}, + {BITSOFFS(8, 8), "PortSelect", mad_dump_uint}, + {BITSOFFS(16, 16), "CounterSelect", mad_dump_hex}, + {BITSOFFS(32, 16), "SymbolErrors", mad_dump_uint}, + {BITSOFFS(48, 8), "LinkRecovers", mad_dump_uint}, + {BITSOFFS(56, 8), "LinkDowned", mad_dump_uint}, + {BITSOFFS(64, 16), "RcvErrors", mad_dump_uint}, + {BITSOFFS(80, 16), "RcvRemotePhysErrors", mad_dump_uint}, + {BITSOFFS(96, 16), "RcvSwRelayErrors", mad_dump_uint}, + {BITSOFFS(112, 16), "XmtDiscards", mad_dump_uint}, + {BITSOFFS(128, 8), "XmtConstraintErrors", mad_dump_uint}, + {BITSOFFS(136, 8), "RcvConstraintErrors", mad_dump_uint}, + {BITSOFFS(144, 8), "CounterSelect2", mad_dump_uint}, + {BITSOFFS(152, 4), "LinkIntegrityErrors", mad_dump_uint}, + {BITSOFFS(156, 4), "ExcBufOverrunErrors", mad_dump_uint}, + {BITSOFFS(176, 16), "VL15Dropped", mad_dump_uint}, + {192, 32, "XmtData", mad_dump_uint}, + {224, 32, "RcvData", mad_dump_uint}, + {256, 32, "XmtPkts", mad_dump_uint}, + {288, 32, "RcvPkts", mad_dump_uint}, + {320, 32, "XmtWait", mad_dump_uint}, + {0, 0}, /* IB_PC_LAST_F */ /* * SMInfo */ - [IB_SMINFO_GUID_F] {0, 64, "SmInfoGuid", mad_dump_hex}, - [IB_SMINFO_KEY_F] {64, 64, "SmInfoKey", mad_dump_hex}, - [IB_SMINFO_ACT_F] {128, 32, "SmActivity", mad_dump_uint}, - [IB_SMINFO_PRIO_F] {BITSOFFS(160, 4), "SmPriority", mad_dump_uint}, - [IB_SMINFO_STATE_F] {BITSOFFS(164, 4), "SmState", mad_dump_uint}, + {0, 64, "SmInfoGuid", mad_dump_hex}, + {64, 64, "SmInfoKey", mad_dump_hex}, + {128, 32, "SmActivity", mad_dump_uint}, + {BITSOFFS(160, 4), "SmPriority", mad_dump_uint}, + {BITSOFFS(164, 4), "SmState", mad_dump_uint}, /* * SA RMPP */ - [IB_SA_RMPP_VERS_F] {BE_OFFS(24*8+24, 8), "RmppVers", mad_dump_uint}, - [IB_SA_RMPP_TYPE_F] {BE_OFFS(24*8+16, 8), "RmppType", mad_dump_uint}, - [IB_SA_RMPP_RESP_F] {BE_OFFS(24*8+11, 5), "RmppResp", mad_dump_uint}, - [IB_SA_RMPP_FLAGS_F] {BE_OFFS(24*8+8, 3), "RmppFlags", mad_dump_hex}, - [IB_SA_RMPP_STATUS_F] {BE_OFFS(24*8+0, 8), "RmppStatus", mad_dump_hex}, + {BE_OFFS(24*8+24, 8), "RmppVers", mad_dump_uint}, + {BE_OFFS(24*8+16, 8), "RmppType", mad_dump_uint}, + {BE_OFFS(24*8+11, 5), "RmppResp", mad_dump_uint}, + {BE_OFFS(24*8+8, 3), "RmppFlags", mad_dump_hex}, + {BE_OFFS(24*8+0, 8), "RmppStatus", mad_dump_hex}, /* data1 */ - [IB_SA_RMPP_D1_F] {28*8, 32, "RmppData1", mad_dump_hex}, - [IB_SA_RMPP_SEGNUM_F] {28*8, 32, "RmppSegNum", mad_dump_uint}, + {28*8, 32, "RmppData1", mad_dump_hex}, + {28*8, 32, "RmppSegNum", mad_dump_uint}, /* data2 */ - [IB_SA_RMPP_D2_F] {32*8, 32, "RmppData2", mad_dump_hex}, - [IB_SA_RMPP_LEN_F] {32*8, 32, "RmppPayload", mad_dump_uint}, - [IB_SA_RMPP_NEWWIN_F] {32*8, 32, "RmppNewWin", mad_dump_uint}, - + {32*8, 32, "RmppData2", mad_dump_hex}, + {32*8, 32, "RmppPayload", mad_dump_uint}, + {32*8, 32, "RmppNewWin", mad_dump_uint}, + /* - * SA Path rec + * SA Get Multi Path */ - [IB_SA_PR_DGID_F] {64, 128, "PathRecDGid", mad_dump_array}, - [IB_SA_PR_SGID_F] {192, 128, "PathRecSGid", mad_dump_array}, - [IB_SA_PR_DLID_F] {BITSOFFS(320,16), "PathRecDLid", mad_dump_hex}, - [IB_SA_PR_SLID_F] {BITSOFFS(336,16), "PathRecSLid", mad_dump_hex}, - [IB_SA_PR_NPATH_F] {BITSOFFS(393,7), "PathRecNumPath", mad_dump_uint}, + {BITSOFFS(41,7), "MultiPathNumPath", mad_dump_uint}, + {BITSOFFS(120,8), "MultiPathNumSrc", mad_dump_uint}, + {BITSOFFS(128,8), "MultiPathNumDest", mad_dump_uint}, + {192, 128, "MultiPathGid", mad_dump_array}, /* - * SA Get Multi Path + * SA Path rec */ - [IB_SA_MP_NPATH_F] {BITSOFFS(41,7), "MultiPathNumPath", mad_dump_uint}, - [IB_SA_MP_NSRC_F] {BITSOFFS(120,8), "MultiPathNumSrc", mad_dump_uint}, - [IB_SA_MP_NDEST_F] {BITSOFFS(128,8), "MultiPathNumDest", mad_dump_uint}, - [IB_SA_MP_GID0_F] {192, 128, "MultiPathGid", mad_dump_array}, + {64, 128, "PathRecDGid", mad_dump_array}, + {192, 128, "PathRecSGid", mad_dump_array}, + {BITSOFFS(320,16), "PathRecDLid", mad_dump_hex}, + {BITSOFFS(336,16), "PathRecSLid", mad_dump_hex}, + {BITSOFFS(393,7), "PathRecNumPath", mad_dump_uint}, /* * MC Member rec */ - [IB_SA_MCM_MGID_F] {0, 128, "McastMemMGid", mad_dump_array}, - [IB_SA_MCM_PORTGID_F] {128, 128, "McastMemPortGid", mad_dump_array}, - [IB_SA_MCM_QKEY_F] {256, 32, "McastMemQkey", mad_dump_hex}, - [IB_SA_MCM_MLID_F] {BITSOFFS(288, 16), "McastMemMLid", mad_dump_hex}, - [IB_SA_MCM_MTU_F] {BITSOFFS(306, 6), "McastMemMTU", mad_dump_uint}, - [IB_SA_MCM_TCLASS_F] {BITSOFFS(312, 8), "McastMemTClass", mad_dump_uint}, - [IB_SA_MCM_PKEY_F] {BITSOFFS(320, 16), "McastMemPkey", mad_dump_uint}, - [IB_SA_MCM_RATE_F] {BITSOFFS(338, 6), "McastMemRate", mad_dump_uint}, - [IB_SA_MCM_SL_F] {BITSOFFS(352, 4), "McastMemSL", mad_dump_uint}, - [IB_SA_MCM_FLOW_LABEL_F] {BITSOFFS(356, 20), "McastMemFlowLbl", mad_dump_uint}, - [IB_SA_MCM_JOIN_STATE_F] {BITSOFFS(388, 4), "McastMemJoinState", mad_dump_uint}, - [IB_SA_MCM_PROXY_JOIN_F] {BITSOFFS(392, 1), "McastMemProxyJoin", mad_dump_uint}, + {0, 128, "McastMemMGid", mad_dump_array}, + {128, 128, "McastMemPortGid", mad_dump_array}, + {256, 32, "McastMemQkey", mad_dump_hex}, + {BITSOFFS(288, 16), "McastMemMLid", mad_dump_hex}, + {BITSOFFS(352, 4), "McastMemSL", mad_dump_uint}, + {BITSOFFS(306, 6), "McastMemMTU", mad_dump_uint}, + {BITSOFFS(338, 6), "McastMemRate", mad_dump_uint}, + {BITSOFFS(312, 8), "McastMemTClass", mad_dump_uint}, + {BITSOFFS(320, 16), "McastMemPkey", mad_dump_uint}, + {BITSOFFS(356, 20), "McastMemFlowLbl", mad_dump_uint}, + {BITSOFFS(388, 4), "McastMemJoinState", mad_dump_uint}, + {BITSOFFS(392, 1), "McastMemProxyJoin", mad_dump_uint}, /* * Service record */ - [IB_SA_SR_ID_F] {0, 64, "ServRecID", mad_dump_hex}, - [IB_SA_SR_GID_F] {64, 128, "ServRecGid", mad_dump_array}, - [IB_SA_SR_PKEY_F] {BITSOFFS(192, 16), "ServRecPkey", mad_dump_hex}, - [IB_SA_SR_LEASE_F] {224, 32, "ServRecLease", mad_dump_hex}, - [IB_SA_SR_KEY_F] {256, 128, "ServRecKey", mad_dump_hex}, - [IB_SA_SR_NAME_F] {384, 512, "ServRecName", mad_dump_string}, - [IB_SA_SR_DATA_F] {896, 512, "ServRecData", mad_dump_array}, /* ATS for example */ + {0, 64, "ServRecID", mad_dump_hex}, + {64, 128, "ServRecGid", mad_dump_array}, + {BITSOFFS(192, 16), "ServRecPkey", mad_dump_hex}, + {224, 32, "ServRecLease", mad_dump_hex}, + {256, 128, "ServRecKey", mad_dump_hex}, + {384, 512, "ServRecName", mad_dump_string}, + {896, 512, "ServRecData", mad_dump_array}, /* ATS for example */ /* * ATS SM record - within SA_SR_DATA */ - [IB_ATS_SM_NODE_ADDR_F] {12*8, 32, "ATSNodeAddr", mad_dump_hex}, - [IB_ATS_SM_MAGIC_KEY_F] {BITSOFFS(16*8, 16), "ATSMagicKey", mad_dump_hex}, - [IB_ATS_SM_NODE_TYPE_F] {BITSOFFS(18*8, 16), "ATSNodeType", mad_dump_hex}, - [IB_ATS_SM_NODE_NAME_F] {32*8, 32*8, "ATSNodeName", mad_dump_string}, + {12*8, 32, "ATSNodeAddr", mad_dump_hex}, + {BITSOFFS(16*8, 16), "ATSMagicKey", mad_dump_hex}, + {BITSOFFS(18*8, 16), "ATSNodeType", mad_dump_hex}, + {32*8, 32*8, "ATSNodeName", mad_dump_string}, /* * SLTOVL MAPPING TABLE */ - [IB_SLTOVL_MAPPING_TABLE_F] {0, 64, "SLToVLMap", mad_dump_hex}, + {0, 64, "SLToVLMap", mad_dump_hex}, /* * VL ARBITRATION TABLE */ - [IB_VL_ARBITRATION_TABLE_F] {0, 512, "VLArbTbl", mad_dump_array}, + {0, 512, "VLArbTbl", mad_dump_array}, /* * IB vendor classes range 2 */ - [IB_VEND2_OUI_F] {BE_OFFS(36*8, 24), "OUI", mad_dump_array}, - [IB_VEND2_DATA_F] {40*8, (256-40)*8, "Vendor2Data", mad_dump_array}, + {BE_OFFS(36*8, 24), "OUI", mad_dump_array}, + {40*8, (256-40)*8, "Vendor2Data", mad_dump_array}, /* * Extended port counters */ - [IB_PC_EXT_PORT_SELECT_F] {BITSOFFS(8, 8), "PortSelect", mad_dump_uint}, - [IB_PC_EXT_COUNTER_SELECT_F] {BITSOFFS(16, 16), "CounterSelect", mad_dump_hex}, - [IB_PC_EXT_XMT_BYTES_F] {64, 64, "PortXmitData", mad_dump_uint}, - [IB_PC_EXT_RCV_BYTES_F] {128, 64, "PortRcvData", mad_dump_uint}, - [IB_PC_EXT_XMT_PKTS_F] {192, 64, "PortXmitPkts", mad_dump_uint}, - [IB_PC_EXT_RCV_PKTS_F] {256, 64, "PortRcvPkts", mad_dump_uint}, - [IB_PC_EXT_XMT_UPKTS_F] {320, 64, "PortUnicastXmitPkts", mad_dump_uint}, - [IB_PC_EXT_RCV_UPKTS_F] {384, 64, "PortUnicastRcvPkts", mad_dump_uint}, - [IB_PC_EXT_XMT_MPKTS_F] {448, 64, "PortMulticastXmitPkts", mad_dump_uint}, - [IB_PC_EXT_RCV_MPKTS_F] {512, 64, "PortMulticastRcvPkts", mad_dump_uint}, + {BITSOFFS(8, 8), "PortSelect", mad_dump_uint}, + {BITSOFFS(16, 16), "CounterSelect", mad_dump_hex}, + {64, 64, "PortXmitData", mad_dump_uint}, + {128, 64, "PortRcvData", mad_dump_uint}, + {192, 64, "PortXmitPkts", mad_dump_uint}, + {256, 64, "PortRcvPkts", mad_dump_uint}, + {320, 64, "PortUnicastXmitPkts", mad_dump_uint}, + {384, 64, "PortUnicastRcvPkts", mad_dump_uint}, + {448, 64, "PortMulticastXmitPkts", mad_dump_uint}, + {512, 64, "PortMulticastRcvPkts", mad_dump_uint}, + {0, 0}, /* IB_PC_EXT_LAST_F */ /* * GUIDInfo fields */ - [IB_GUID_GUID0_F] {0, 64, "GUID0", mad_dump_hex}, + {0, 64, "GUID0", mad_dump_hex}, + {0, 0} /* IB_FIELD_LAST_ */ }; @@ -561,9 +576,12 @@ static char *_mad_dump_field(const ib_field_t *f, const char *name, char *buf, i static int _mad_dump(ib_mad_dump_fn *fn, const char *name, void *val, int valsz) { - ib_field_t f = { .def_dump_fn = fn, .bitlen = valsz * 8}; + ib_field_t f; char buf[512]; + f.def_dump_fn = fn; + f.bitlen = valsz * 8; + return printf("%s\n", _mad_dump_field(&f, name, buf, sizeof buf, val)); } -- 1.5.2.5 From arlin.r.davis at intel.com Sat Jan 17 11:51:19 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sat, 17 Jan 2009 11:51:19 -0800 Subject: [ofa-general] [PATCH 3/3 v2] Minor changes to allow portability to WinOF Message-ID: - cleanup unnecessary include files - ULL declaration on 64 bit constants - cast or change data types to fix build warnings on windows Signed-off-by: Arlin Davis --- libibmad/src/dump.c | 25 +++++++++---------------- libibmad/src/gs.c | 7 +------ libibmad/src/mad.c | 14 +++++--------- libibmad/src/portid.c | 9 +++------ libibmad/src/register.c | 3 +-- libibmad/src/resolve.c | 3 +-- libibmad/src/rpc.c | 9 ++++----- libibmad/src/sa.c | 1 - libibmad/src/serv.c | 4 +--- libibmad/src/smp.c | 1 - libibmad/src/vendor.c | 1 - 11 files changed, 25 insertions(+), 52 deletions(-) diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c index 38a2254..a7575b9 100644 --- a/libibmad/src/dump.c +++ b/libibmad/src/dump.c @@ -36,13 +36,6 @@ # include #endif /* HAVE_CONFIG_H */ -#include -#include -#include -#include -#include -#include - #include void @@ -114,13 +107,13 @@ mad_dump_hex(char *buf, int bufsz, void *val, int valsz) snprintf(buf, bufsz, "0x%08x", *(uint32_t *)val); break; case 5: - snprintf(buf, bufsz, "0x%010" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffllu); + snprintf(buf, bufsz, "0x%010" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffULL); break; case 6: - snprintf(buf, bufsz, "0x%012" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffffllu); + snprintf(buf, bufsz, "0x%012" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffffULL); break; case 7: - snprintf(buf, bufsz, "0x%014" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffffffllu); + snprintf(buf, bufsz, "0x%014" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffffffULL); break; case 8: snprintf(buf, bufsz, "0x%016" PRIx64, *(uint64_t *)val); @@ -148,13 +141,13 @@ mad_dump_rhex(char *buf, int bufsz, void *val, int valsz) snprintf(buf, bufsz, "%08x", *(uint32_t *)val); break; case 5: - snprintf(buf, bufsz, "%010" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffllu); + snprintf(buf, bufsz, "%010" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffULL); break; case 6: - snprintf(buf, bufsz, "%012" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffffllu); + snprintf(buf, bufsz, "%012" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffffULL); break; case 7: - snprintf(buf, bufsz, "%014" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffffffllu); + snprintf(buf, bufsz, "%014" PRIx64, *(uint64_t *)val & (uint64_t) 0xffffffffffffffULL); break; case 8: snprintf(buf, bufsz, "%016" PRIx64, *(uint64_t *)val); @@ -606,7 +599,7 @@ typedef struct _ib_vl_arb_table { uint8_t res_vl; uint8_t weight; } vl_entry[IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK]; -} __attribute__((packed)) ib_vl_arb_table_t; +} ib_vl_arb_table_t; static inline void ib_vl_arb_get_vl(uint8_t res_vl, uint8_t *const vl ) @@ -634,7 +627,7 @@ void mad_dump_vlarbitration(char *buf, int bufsz, void *val, int num) { ib_vl_arb_table_t* p_vla_tbl = val; - unsigned i, n; + int i, n; uint8_t vl; num /= sizeof(p_vla_tbl->vl_entry[0]); @@ -681,7 +674,7 @@ _dump_fields(char *buf, int bufsz, void *data, int start, int end) bufsz -= n; } - return s - buf; + return (int)(s - buf); } void diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index d350c0d..05ee132 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -35,13 +35,8 @@ # include #endif /* HAVE_CONFIG_H */ -#include -#include -#include -#include - -#include #include "mad.h" +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c index be27c09..4a12e6c 100644 --- a/libibmad/src/mad.c +++ b/libibmad/src/mad.c @@ -35,14 +35,10 @@ # include #endif /* HAVE_CONFIG_H */ -#include -#include -#include -#include #include -#include #include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN @@ -55,7 +51,7 @@ mad_trid(void) uint64_t next; if (!base) { - srandom(time(0)*getpid()); + srandom((int)time(0)*getpid()); base = random(); trid = random(); } @@ -97,8 +93,8 @@ mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, void *data) mad_set_field(buf, 0, IB_MAD_ATTRMOD_F, rpc->attr.mod); /* words 7,8 */ - mad_set_field(buf, 0, IB_MAD_MKEY_F, rpc->mkey >> 32); - mad_set_field(buf, 4, IB_MAD_MKEY_F, rpc->mkey & 0xffffffff); + mad_set_field(buf, 0, IB_MAD_MKEY_F, (uint32_t)(rpc->mkey >> 32)); + mad_set_field(buf, 4, IB_MAD_MKEY_F, (uint32_t)(rpc->mkey & 0xffffffff)); if (rpc->mgtclass == IB_SMI_DIRECT_CLASS) { /* word 9 */ @@ -168,5 +164,5 @@ mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, mad_set_field(mad, 0, IB_SA_RMPP_D2_F, rmpp->d2.u); } - return p - mad; + return ((int)(p - mad)); } diff --git a/libibmad/src/portid.c b/libibmad/src/portid.c index 61e6be0..9a69728 100644 --- a/libibmad/src/portid.c +++ b/libibmad/src/portid.c @@ -37,10 +37,7 @@ #include #include -#include #include -#include -#include #include @@ -94,7 +91,7 @@ str2drpath(ib_dr_path_t *path, char *routepath, int drslid, int drdlid) while (str && *str) { if ((s = strchr(str, ','))) *s = 0; - path->p[++path->cnt] = atoi(str); + path->p[++path->cnt] = (uint8_t)atoi(str); if (!s) break; str = s+1; @@ -112,11 +109,11 @@ drpath2str(ib_dr_path_t *path, char *dstr, size_t dstr_size) int i = 0; int rc = snprintf(dstr, dstr_size, "slid %d; dlid %d; %d", path->drslid, path->drdlid, path->p[0]); - if (rc >= dstr_size) + if (rc >= (int)dstr_size) return dstr; for (i = 1; i <= path->cnt; i++) { rc += snprintf(dstr+rc, dstr_size-rc, ",%d", path->p[i]); - if (rc >= dstr_size) + if (rc >= (int)dstr_size) break; } return (dstr); diff --git a/libibmad/src/register.c b/libibmad/src/register.c index 045f840..ea916da 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -37,12 +37,11 @@ #include #include -#include #include #include -#include #include "mad.h" +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index 906b28d..2e373e6 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -37,11 +37,10 @@ #include #include -#include #include -#include #include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index 670a936..3bf3640 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -37,12 +37,11 @@ #include #include -#include #include #include -#include #include "mad.h" +#include #define MAX_CLASS 256 @@ -129,7 +128,7 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, save_mad = 0; } - trid = mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); + trid = (uint32_t)mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); for (retries = 0; retries < madrpc_retries; retries++) { if (retries) { @@ -298,7 +297,7 @@ madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, int num_classes) IBPANIC("too many classes %d requested", num_classes); while (num_classes--) { - int rmpp_version = 0; + uint8_t rmpp_version = 0; int mgmt = *mgmt_classes++; if (mgmt == IB_SA_CLASS) @@ -343,7 +342,7 @@ mad_rpc_open_port(char *dev_name, int dev_port, } while (num_classes--) { - int rmpp_version = 0; + uint8_t rmpp_version = 0; int mgmt = *mgmt_classes++; int agent; diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c index c601254..c08a392 100644 --- a/libibmad/src/sa.c +++ b/libibmad/src/sa.c @@ -37,7 +37,6 @@ #include #include -#include #include #include diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c index b329352..611a93f 100644 --- a/libibmad/src/serv.c +++ b/libibmad/src/serv.c @@ -37,12 +37,10 @@ #include #include -#include #include -#include -#include #include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c index ad6b066..e872602 100644 --- a/libibmad/src/smp.c +++ b/libibmad/src/smp.c @@ -37,7 +37,6 @@ #include #include -#include #include #include diff --git a/libibmad/src/vendor.c b/libibmad/src/vendor.c index eb703f6..7928a58 100644 --- a/libibmad/src/vendor.c +++ b/libibmad/src/vendor.c @@ -37,7 +37,6 @@ #include #include -#include #include #include -- 1.5.2.5 From dotanba at gmail.com Sat Jan 17 22:51:00 2009 From: dotanba at gmail.com (Dotan Barak) Date: Sun, 18 Jan 2009 08:51:00 +0200 Subject: ***SPAM*** Re: [ofa-general] local DMA transfer? In-Reply-To: References: <4971C319.9050405@gmail.com> Message-ID: <2f3bf9a60901172251w5888664aof17b152de644f136@mail.gmail.com> On Sat, Jan 17, 2009 at 9:38 PM, Yicheng Jia wrote: > > Hi Dotan, > > Does HCA provide any internal route for local DMA so that local data > transfer doesn't has to go out of the HCA port as regular QPs do? In another > word, it's not efficient to use QPs for local DMA transfer, is it true? Yes, the HCA provide internal route within the IB port (it called internal loopback). The HCA recognize that the packets should be routed to its own QP, so the packets don't go out through the IB port. I don't know if this is efficient as memcpy, but i think that you can use it safely without any worry of performance or correctness issues. Dotan From jackm at dev.mellanox.co.il Sun Jan 18 02:10:20 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 18 Jan 2009 12:10:20 +0200 Subject: [ofa-general] Re: [PATCH] mlx4_ib: fix for bugzilla 1383 (LSO packet processing) In-Reply-To: References: <200812291223.11753.jackm@dev.mellanox.co.il> Message-ID: <200901181210.20806.jackm@dev.mellanox.co.il> On Friday 16 January 2009 22:02, Roland Dreier wrote: > OK, I think I'm going to merge my version of the patch. If there really > is a performance penalty I'd rather move the mlx transport stuff > out-of-line first rather than make the code too unreadble with gotos and > duplication etc. > Roland, I'll be doing the comparison test today (Sunday Jan 18). Please wait, and I'll have the results in a few hours. Thanks, and sorry for the delay. - Jack From vlad at lists.openfabrics.org Sun Jan 18 03:13:27 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 18 Jan 2009 03:13:27 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090118-0200 daily build status Message-ID: <20090118111327.59CE7E60F0E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From jackm at dev.mellanox.co.il Sun Jan 18 09:08:36 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 18 Jan 2009 19:08:36 +0200 Subject: [ofa-general] Re: [PATCH] mlx4_ib: fix for bugzilla 1383 (LSO packet processing) In-Reply-To: References: <200812291223.11753.jackm@dev.mellanox.co.il> <200812301920.50336.jackm@dev.mellanox.co.il> Message-ID: <200901181908.36417.jackm@dev.mellanox.co.il> On Friday 16 January 2009 22:10, Roland Dreier wrote: > So I'll merge the patch with the wmb() there, and you can convince me to > get rid of it later if my reasoning is wrong. > We did performance testing on your version of the patch, and my version, and there was no statistically significant difference between the two versions (even on the main non-lso flow of post send, which has an extra assignment). Lets use your version, then -- the code is much cleaner. Thanks for investing the time! - Jack From vlad at lists.openfabrics.org Mon Jan 19 03:14:21 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 19 Jan 2009 03:14:21 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090119-0200 daily build status Message-ID: <20090119111422.23F01E609E2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From todd.rimmer at qlogic.com Mon Jan 19 06:39:41 2009 From: todd.rimmer at qlogic.com (Todd Rimmer) Date: Mon, 19 Jan 2009 08:39:41 -0600 Subject: [ofa-general] How can user app be notified of local SMA changes? In-Reply-To: <200901181908.36417.jackm@dev.mellanox.co.il> References: <200812291223.11753.jackm@dev.mellanox.co.il> <200812301920.50336.jackm@dev.mellanox.co.il> <200901181908.36417.jackm@dev.mellanox.co.il> Message-ID: <5AEC2602AE03EB46BFC16C6B9B200DA813496AF4CA@MNEXMB2.qlogic.org> Is there an OFED API so a user space application can register for callbacks from the SMA for events such as: - LID changed - SMLID changed - Subnet Prefix Changed - PKey table changed - Potentially other aspects of SMA changed Todd Rimmer Chief Architect QLogic Network Systems Group Voice: 610-233-4852 Fax: 610-233-4777 Todd.Rimmer at QLogic.com www.QLogic.com From hal.rosenstock at gmail.com Mon Jan 19 09:13:18 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 19 Jan 2009 12:13:18 -0500 Subject: [ofa-general] How can user app be notified of local SMA changes? In-Reply-To: <5AEC2602AE03EB46BFC16C6B9B200DA813496AF4CA@MNEXMB2.qlogic.org> References: <200812291223.11753.jackm@dev.mellanox.co.il> <200812301920.50336.jackm@dev.mellanox.co.il> <200901181908.36417.jackm@dev.mellanox.co.il> <5AEC2602AE03EB46BFC16C6B9B200DA813496AF4CA@MNEXMB2.qlogic.org> Message-ID: On Mon, Jan 19, 2009 at 9:39 AM, Todd Rimmer wrote: > Is there an OFED API so a user space application can register for callbacks from the SMA for events such as: > - LID changed > - SMLID changed > - Subnet Prefix Changed > - PKey table changed > - Potentially other aspects of SMA changed libibverbs supports a number of these events currently. Subnet prefix changed (and any others needed) could be added. -- Hal > Todd Rimmer > Chief Architect > QLogic Network Systems Group > Voice: 610-233-4852 Fax: 610-233-4777 > Todd.Rimmer at QLogic.com www.QLogic.com > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sean.hefty at intel.com Mon Jan 19 09:40:22 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 19 Jan 2009 09:40:22 -0800 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> Message-ID: <000101c97a5d$0186abc0$dd5b180a@amr.corp.intel.com> adding ofw mail list >Ok, here is revision 2 of the libibmad WinOF portability patches. I eliminated >all #ifdef _WIN32 >crap and tried to limit the changes by adding os dependent file mad_osd.h. With >these changes we >could share the same code base for both OFED and WinOF. Please review and >consider accepting this >patch set. > >[PATCH 1/3] libibmad: add os dependent definitions. >[PATCH 2/3] field.c remove c99 definitions, better portability with WinOF. >[PATCH 3/3] Minor changes to allow portability to WinOF > >infiniband/mad_osd.h added to provide support for os specific defintions >for portability. With these changes, WinOF can pull directly from OFED >git tree and share a common code base with minimal changes to mad.h and >source tree. > >mad.h modifications include MAD_EXPORT for export declarations where >appropriate. Datatype llu changed to ULL for 64bit constants. > >makefile.am modified to include new linux version of mad_osd.h > >Signed-off-by: Arlin Davis >--- > libibmad/Makefile.am | 7 +- > libibmad/include/infiniband/mad.h | 120 ++++++++++++++++----------------- > libibmad/include/infiniband/mad_osd.h | 48 +++++++++++++ > 3 files changed, 109 insertions(+), 66 deletions(-) > create mode 100644 libibmad/include/infiniband/mad_osd.h > >diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am >index beae1a4..8dea157 100644 >--- a/libibmad/Makefile.am >+++ b/libibmad/Makefile.am >@@ -1,7 +1,7 @@ > > SUBDIRS = . > >-INCLUDES = -I$(srcdir)/include/infiniband -I$(includedir) >+INCLUDES = -I$(srcdir)/include -I$(srcdir)/include/infiniband -I$(includedir) > > lib_LTLIBRARIES = libibmad.la > >@@ -23,9 +23,10 @@ libibmad_la_DEPENDENCIES = $(srcdir)/src/libibmad.map > > libibmadincludedir = $(includedir)/infiniband > >-libibmadinclude_HEADERS = $(srcdir)/include/infiniband/mad.h >+libibmadinclude_HEADERS = $(srcdir)/include/infiniband/mad.h >$(srcdir)/include/infiniband/mad_osd.h > >-EXTRA_DIST = $(srcdir)/include/infiniband/mad.h libibmad.spec.in libibmad.spec >\ >+EXTRA_DIST = $(srcdir)/include/infiniband/mad.h >$(srcdir)/include/infiniband/mad_osd.h \ >+ libibmad.spec.in libibmad.spec \ > $(srcdir)/src/libibmad.map libibmad.ver autogen.sh > > dist-hook: >diff --git a/libibmad/include/infiniband/mad.h >b/libibmad/include/infiniband/mad.h >index 0a962c0..fe607a7 100644 >--- a/libibmad/include/infiniband/mad.h >+++ b/libibmad/include/infiniband/mad.h >@@ -33,13 +33,7 @@ > #ifndef _MAD_H_ > #define _MAD_H_ > >-#include >-#include >-#include >-#include >-#include >-#include >-#include >+#include > > #ifdef __cplusplus > # define BEGIN_C_DECLS extern "C" { >@@ -52,7 +46,7 @@ > BEGIN_C_DECLS > > #define IB_SUBNET_PATH_HOPS_MAX 64 >-#define IB_DEFAULT_SUBN_PREFIX 0xfe80000000000000llu >+#define IB_DEFAULT_SUBN_PREFIX 0xfe80000000000000ULL > #define IB_DEFAULT_QP1_QKEY 0x80010000 > > #define IB_MAD_SIZE 256 >@@ -627,10 +621,10 @@ enum { > >/****************************************************************************** >/ > > /* portid.c */ >-char * portid2str(ib_portid_t *portid); >-int portid2portnum(ib_portid_t *portid); >-int str2drpath(ib_dr_path_t *path, char *routepath, int drslid, int drdlid); >-char * drpath2str(ib_dr_path_t *path, char *dstr, size_t dstr_size); >+MAD_EXPORT char * portid2str(ib_portid_t *portid); >+MAD_EXPORT int portid2portnum(ib_portid_t *portid); >+MAD_EXPORT int str2drpath(ib_dr_path_t *path, char *routepath, int drslid, >int drdlid); >+MAD_EXPORT char * drpath2str(ib_dr_path_t *path, char *dstr, size_t >dstr_size); > > static inline int > ib_portid_set(ib_portid_t *portid, int lid, int qp, int qkey) >@@ -644,44 +638,44 @@ ib_portid_set(ib_portid_t *portid, int lid, int qp, int >qkey) > } > > /* fields.c */ >-uint32_t mad_get_field(void *buf, int base_offs, int field); >-void mad_set_field(void *buf, int base_offs, int field, uint32_t val); >+MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field); >+MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field, uint32_t >val); > /* field must be byte aligned */ >-uint64_t mad_get_field64(void *buf, int base_offs, int field); >-void mad_set_field64(void *buf, int base_offs, int field, uint64_t val); >-void mad_set_array(void *buf, int base_offs, int field, void *val); >-void mad_get_array(void *buf, int base_offs, int field, void *val); >-void mad_decode_field(uint8_t *buf, int field, void *val); >-void mad_encode_field(uint8_t *buf, int field, void *val); >-int mad_print_field(int field, const char *name, void *val); >-char *mad_dump_field(int field, char *buf, int bufsz, void *val); >-char *mad_dump_val(int field, char *buf, int bufsz, void *val); >+MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field); >+MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field, uint64_t >val); >+MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val); >+MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val); >+MAD_EXPORT void mad_decode_field(uint8_t *buf, int field, void *val); >+MAD_EXPORT void mad_encode_field(uint8_t *buf, int field, void *val); >+MAD_EXPORT int mad_print_field(int field, const char *name, void *val); >+MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val); >+MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val); > > /* mad.c */ >-void *mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, void *data); >-uint64_t mad_trid(void); >-int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t >*rmpp, void *data); >+MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, >void *data); >+MAD_EXPORT uint64_t mad_trid(void); >+MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, >ib_rmpp_hdr_t *rmpp, >void *data); > > /* register.c */ >-int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); >-int mad_register_client(int mgmt, uint8_t rmpp_version); >-int mad_register_server(int mgmt, uint8_t rmpp_version, >+MAD_EXPORT int mad_register_port_client(int port_id, int mgmt, uint8_t >rmpp_version); >+MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version); >+MAD_EXPORT int mad_register_server(int mgmt, uint8_t rmpp_version, > long method_mask[16/sizeof(long)], > uint32_t class_oui); >-int mad_class_agent(int mgmt); >-int mad_agent_class(int agent); >+MAD_EXPORT int mad_class_agent(int mgmt); >+MAD_EXPORT int mad_agent_class(int agent); > > /* serv.c */ >-int mad_send(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, >+MAD_EXPORT int mad_send(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t >*rmpp, > void *data); >-void * mad_receive(void *umad, int timeout); >-int mad_respond(void *umad, ib_portid_t *portid, uint32_t rstatus); >-void * mad_alloc(void); >-void mad_free(void *umad); >+MAD_EXPORT void * mad_receive(void *umad, int timeout); >+MAD_EXPORT int mad_respond(void *umad, ib_portid_t *portid, uint32_t >rstatus); >+MAD_EXPORT void * mad_alloc(void); >+MAD_EXPORT void mad_free(void *umad); > > /* vendor.c */ >-uint8_t *ib_vendor_call(void *data, ib_portid_t *portid, >- ib_vendor_call_t *call); >+MAD_EXPORT uint8_t *ib_vendor_call(void *data, ib_portid_t *portid, >+ ib_vendor_call_t *call); > > static inline int > mad_is_vendor_range1(int mgmt) >@@ -696,29 +690,29 @@ mad_is_vendor_range2(int mgmt) > } > > /* rpc.c */ >-int madrpc_portid(void); >-int madrpc_set_retries(int retries); >-int madrpc_set_timeout(int timeout); >-void * madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void >*rcvdata); >-void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, >+MAD_EXPORT int madrpc_portid(void); >+MAD_EXPORT int madrpc_set_retries(int retries); >+MAD_EXPORT int madrpc_set_timeout(int timeout); >+void * madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void >*rcvdata); >+void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, > void *data); >-void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, >+MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > void madrpc_save_mad(void *madbuf, int len); >-void madrpc_show_errors(int set); >+MAD_EXPORT void madrpc_show_errors(int set); > >-void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, >+void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > void mad_rpc_close_port(void *ibmad_port); >-void * mad_rpc(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, >+void * mad_rpc(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > void *payload, void *rcvdata); >-void * mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t >*dport, >+void * mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > ib_rmpp_hdr_t *rmpp, void *data); > > /* smp.c */ >-uint8_t * smp_query(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, >+MAD_EXPORT uint8_t * smp_query(void *buf, ib_portid_t *id, unsigned attrid, >unsigned mod, > unsigned timeout); >-uint8_t * smp_set(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, >+MAD_EXPORT uint8_t * smp_set(void *buf, ib_portid_t *id, unsigned attrid, >unsigned mod, > unsigned timeout); > uint8_t * smp_query_via(void *buf, ib_portid_t *id, unsigned attrid, > unsigned mod, unsigned timeout, const void *srcport); >@@ -730,18 +724,18 @@ uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, >ib_sa_call_t *sa, > unsigned timeout); > uint8_t * sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t >*portid, > ib_sa_call_t *sa, unsigned timeout); >-int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t >*sm_id, >+MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, >ib_portid_t *sm_id, > void *buf); /* returns lid */ > int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t *sm_id, void *buf); > > /* resolve.c */ >-int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); >-int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, >+MAD_EXPORT int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); >+MAD_EXPORT int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, > ib_portid_t *sm_id, int timeout); >-int ib_resolve_portid_str(ib_portid_t *portid, char *addr_str, >+MAD_EXPORT int ib_resolve_portid_str(ib_portid_t *portid, char *addr_str, > int dest_type, ib_portid_t *sm_id); >-int ib_resolve_self(ib_portid_t *portid, int *portnum, ibmad_gid_t *gid); >+MAD_EXPORT int ib_resolve_self(ib_portid_t *portid, int *portnum, >ibmad_gid_t *gid); > > int ib_resolve_smlid_via(ib_portid_t *sm_id, int timeout, > const void *srcport); >@@ -755,19 +749,19 @@ int ib_resolve_self_via(ib_portid_t *portid, int >*portnum, ibmad_gid_t >*gid, > const void *srcport); > > /* gs.c */ >-uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t *dest, int port, >+MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t *dest, >int port, > unsigned timeout); >-uint8_t *port_performance_query(void *rcvbuf, ib_portid_t *dest, int port, >+MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t *dest, >int port, > unsigned timeout); >-uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t *dest, int port, >+MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t *dest, >int port, > unsigned mask, unsigned timeout); >-uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t *dest, int port, >+MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t >*dest, int port, > unsigned timeout); >-uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t *dest, int port, >+MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t >*dest, int port, > unsigned mask, unsigned timeout); >-uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t *dest, int port, >+MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t >*dest, int port, > unsigned timeout); >-uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t *dest, int port, >+MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t *dest, >int port, > unsigned timeout); > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t *dest, int >port, >@@ -785,7 +779,7 @@ uint8_t *port_samples_control_query_via(void *rcvbuf, >ib_portid_t *dest, int por > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t *dest, int >port, > unsigned timeout, const void *srcport); > /* dump.c */ >-ib_mad_dump_fn >+MAD_EXPORT ib_mad_dump_fn > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > mad_dump_bitfield, mad_dump_array, mad_dump_string, > mad_dump_linkwidth, mad_dump_linkwidthsup, mad_dump_linkwidthen, >diff --git a/libibmad/include/infiniband/mad_osd.h >b/libibmad/include/infiniband/mad_osd.h >new file mode 100644 >index 0000000..45741c5 >--- /dev/null >+++ b/libibmad/include/infiniband/mad_osd.h >@@ -0,0 +1,49 @@ >+/* >+ * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. >+ * Copyright (c) 2009 Intel Corporation All rights reserved. >+ * >+ * This software is available to you under a choice of one of two >+ * licenses. You may choose to be licensed under the terms of the GNU >+ * General Public License (GPL) Version 2, available from the file >+ * COPYING in the main directory of this source tree, or the >+ * OpenIB.org BSD license below: >+ * >+ * Redistribution and use in source and binary forms, with or >+ * without modification, are permitted provided that the following >+ * conditions are met: >+ * >+ * - Redistributions of source code must retain the above >+ * copyright notice, this list of conditions and the following >+ * disclaimer. >+ * >+ * - Redistributions in binary form must reproduce the above >+ * copyright notice, this list of conditions and the following >+ * disclaimer in the documentation and/or other materials >+ * provided with the distribution. >+ * >+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, >+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF >+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND >+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS >+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN >+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN >+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE >+ * SOFTWARE. >+ * >+ */ >+#ifndef _MAD_OSD_H_ >+#define _MAD_OSD_H_ >+ >+#include >+#include >+#include >+#include >+#include >+#include >+#include >+#include >+#include >+ >+#define MAD_EXPORT >+ >+#endif /* _MAD_OSD_H_ */ >-- >1.5.2.5 > > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Mon Jan 19 09:40:41 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 19 Jan 2009 09:40:41 -0800 Subject: [ofa-general] [PATCH 2/3 v2] field.c, remove c99 definitions, better portability with WinOF. In-Reply-To: References: Message-ID: <000201c97a5d$0cba6860$dd5b180a@amr.corp.intel.com> adding ofw mail list >- Remove c99 definitions in the ib_mad_f structure. >- Remove unnecessary include file >- _mad_dump: remove c99 structure initialization. > >Signed-off-by: Arlin Davis >--- > libibmad/src/fields.c | 430 +++++++++++++++++++++++++----------------------- > 1 files changed, 224 insertions(+), 206 deletions(-) > >diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c >index 5cebd01..17d0a03 100644 >--- a/libibmad/src/fields.c >+++ b/libibmad/src/fields.c >@@ -37,7 +37,6 @@ > > #include > #include >-#include > #include > > #include >@@ -54,10 +53,10 @@ > #define BE_TO_BITSOFFS(o, w) (((o) & ~31) | ((32 - ((o) & 31) - (w)))) > > static const ib_field_t ib_mad_f [] = { >- [0] {0, 0}, /* IB_NO_FIELD - reserved as invalid */ >+ {0, 0}, /* IB_NO_FIELD - reserved as invalid */ > >- [IB_GID_PREFIX_F] {0, 64, "GidPrefix", mad_dump_rhex}, >- [IB_GID_GUID_F] {64, 64, "GidGuid", mad_dump_rhex}, >+ {0, 64, "GidPrefix", mad_dump_rhex}, >+ {64, 64, "GidGuid", mad_dump_rhex}, > > /* > * MAD: common MAD fields (IB spec 13.4.2) >@@ -67,300 +66,316 @@ static const ib_field_t ib_mad_f [] = { > */ > > /* first MAD word (0-3 bytes) */ >- [IB_MAD_METHOD_F] {BE_OFFS(0, 7), "MadMethod", >mad_dump_hex}, /* TODO: add dumper */ >- [IB_MAD_RESPONSE_F] {BE_OFFS(7, 1), "MadIsResponse", >mad_dump_uint}, /* TODO: add dumper */ >- [IB_MAD_CLASSVER_F] {BE_OFFS(8, 8), "MadClassVersion", >mad_dump_uint}, >- [IB_MAD_MGMTCLASS_F] {BE_OFFS(16, 8), "MadMgmtClass", >mad_dump_uint}, /* TODO: add dumper */ >- [IB_MAD_BASEVER_F] {BE_OFFS(24, 8), "MadBaseVersion", >mad_dump_uint}, >+ {BE_OFFS(0, 7), "MadMethod", mad_dump_hex}, /* TODO: add dumper */ >+ {BE_OFFS(7, 1), "MadIsResponse", mad_dump_uint}, /* TODO: add dumper */ >+ {BE_OFFS(8, 8), "MadClassVersion", mad_dump_uint}, >+ {BE_OFFS(16, 8), "MadMgmtClass", mad_dump_uint}, /* TODO: add dumper >*/ >+ {BE_OFFS(24, 8), "MadBaseVersion", mad_dump_uint}, > > /* second MAD word (4-7 bytes) */ >- [IB_MAD_STATUS_F] {BE_OFFS(48, 16), "MadStatus", >mad_dump_hex}, /* TODO: add dumper */ >+ {BE_OFFS(48, 16), "MadStatus", mad_dump_hex}, /* TODO: add dumper */ > > /* DR SMP only */ >- [IB_DRSMP_HOPCNT_F] {BE_OFFS(32, 8), "DrSmpHopCnt", >mad_dump_uint}, >- [IB_DRSMP_HOPPTR_F] {BE_OFFS(40, 8), "DrSmpHopPtr", >mad_dump_uint}, >- [IB_DRSMP_STATUS_F] {BE_OFFS(48, 15), "DrSmpStatus", >mad_dump_hex}, /* TODO: add dumper */ >- [IB_DRSMP_DIRECTION_F] {BE_OFFS(63, 1), "DrSmpDirection", >mad_dump_uint}, /* TODO: add dumper */ >+ {BE_OFFS(32, 8), "DrSmpHopCnt", mad_dump_uint}, >+ {BE_OFFS(40, 8), "DrSmpHopPtr", mad_dump_uint}, >+ {BE_OFFS(48, 15), "DrSmpStatus", mad_dump_hex}, /* TODO: add dumper */ >+ {BE_OFFS(63, 1), "DrSmpDirection", mad_dump_uint}, /* TODO: add dumper >*/ > > /* words 3,4,5,6 (8-23 bytes) */ >- [IB_MAD_TRID_F] {64, 64, "MadTRID", mad_dump_hex}, >- [IB_MAD_ATTRID_F] {BE_OFFS(144, 16), "MadAttr", >mad_dump_hex}, /* TODO: add dumper */ >- [IB_MAD_ATTRMOD_F] {160, 32, "MadModifier", mad_dump_hex}, >/* TODO: add dumper */ >+ {64, 64, "MadTRID", mad_dump_hex}, >+ {BE_OFFS(144, 16), "MadAttr", mad_dump_hex}, /* TODO: add dumper */ >+ {160, 32, "MadModifier", mad_dump_hex}, /* TODO: add dumper */ > > /* word 7,8 (24-31 bytes) */ >- [IB_MAD_MKEY_F] {196, 64, "MadMkey", mad_dump_hex}, >+ {196, 64, "MadMkey", mad_dump_hex}, > > /* word 9 (32-37 bytes) */ >- [IB_DRSMP_DRDLID_F] {BE_OFFS(256, 16), "DrSmpDLID", >mad_dump_hex}, >- [IB_DRSMP_DRSLID_F] {BE_OFFS(272, 16), "DrSmpSLID", >mad_dump_hex}, >+ {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, >+ {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, >+ >+ /* word 10,11 (36-43 bytes) */ >+ {0, 0}, /* IB_SA_MKEY_F - reserved as invalid */ > > /* word 12 (44-47 bytes) */ >- [IB_SA_ATTROFFS_F] {BE_OFFS(46*8, 16), "SaAttrOffs", >mad_dump_uint}, >+ {BE_OFFS(46*8, 16), "SaAttrOffs", mad_dump_uint}, > > /* word 13,14 (48-55 bytes) */ >- [IB_SA_COMPMASK_F] {48*8, 64, "SaCompMask", mad_dump_hex}, >+ {48*8, 64, "SaCompMask", mad_dump_hex}, > > /* word 13,14 (56-255 bytes) */ >- [IB_SA_DATA_F] {56*8, (256-56)*8, "SaData", >mad_dump_hex}, >- >- [IB_DRSMP_PATH_F] {1024, 512, "DrSmpPath", mad_dump_hex}, >- [IB_DRSMP_RPATH_F] {1536, 512, "DrSmpRetPath", >mad_dump_hex}, >- >- [IB_GS_DATA_F] {64*8, (256-64) * 8, "GsData", >mad_dump_hex}, >- >+ {56*8, (256-56)*8, "SaData", mad_dump_hex}, >+ >+ /* bytes 64 - 127 */ >+ {0, 0}, /* IB_SM_DATA_F - reserved as invalid */ >+ >+ /* bytes 64 - 256 */ >+ {64*8, (256-64) * 8, "GsData", mad_dump_hex}, >+ >+ /* bytes 128 - 191 */ >+ {1024, 512, "DrSmpPath", mad_dump_hex}, >+ >+ /* bytes 192 - 255 */ >+ {1536, 512, "DrSmpRetPath", mad_dump_hex}, >+ > /* > * PortInfo fields: > */ >- [IB_PORT_MKEY_F] {0, 64, "Mkey", mad_dump_hex}, >- [IB_PORT_GID_PREFIX_F] {64, 64, "GidPrefix", mad_dump_hex}, >- [IB_PORT_LID_F] {BITSOFFS(128, 16), "Lid", >mad_dump_hex}, >- [IB_PORT_SMLID_F] {BITSOFFS(144, 16), "SMLid", >mad_dump_hex}, >- [IB_PORT_CAPMASK_F] {160, 32, "CapMask", >mad_dump_portcapmask}, >- [IB_PORT_DIAG_F] {BITSOFFS(192, 16), "DiagCode", >mad_dump_hex}, >- [IB_PORT_MKEY_LEASE_F] {BITSOFFS(208, 16), "MkeyLeasePeriod", >mad_dump_uint}, >- [IB_PORT_LOCAL_PORT_F] {BITSOFFS(224, 8), "LocalPort", >mad_dump_uint}, >- [IB_PORT_LINK_WIDTH_ENABLED_F] {BITSOFFS(232, 8), "LinkWidthEnabled", >mad_dump_linkwidthen}, >- [IB_PORT_LINK_WIDTH_SUPPORTED_F] {BITSOFFS(240, 8), >"LinkWidthSupported", mad_dump_linkwidthsup}, >- [IB_PORT_LINK_WIDTH_ACTIVE_F] {BITSOFFS(248, 8), "LinkWidthActive", >mad_dump_linkwidth}, >- [IB_PORT_LINK_SPEED_SUPPORTED_F] {BITSOFFS(256, 4), >"LinkSpeedSupported", mad_dump_linkspeedsup}, >- [IB_PORT_STATE_F] {BITSOFFS(260, 4), "LinkState", >mad_dump_portstate}, >- [IB_PORT_PHYS_STATE_F] {BITSOFFS(264, 4), "PhysLinkState", >mad_dump_physportstate}, >- [IB_PORT_LINK_DOWN_DEF_F] {BITSOFFS(268, 4), "LinkDownDefState", >mad_dump_linkdowndefstate}, >- [IB_PORT_MKEY_PROT_BITS_F] {BITSOFFS(272, 2), "ProtectBits", >mad_dump_uint}, >- [IB_PORT_LMC_F] {BITSOFFS(277, 3), "LMC", >mad_dump_uint}, >- [IB_PORT_LINK_SPEED_ACTIVE_F] {BITSOFFS(280, 4), "LinkSpeedActive", >mad_dump_linkspeed}, >- [IB_PORT_LINK_SPEED_ENABLED_F] {BITSOFFS(284, 4), "LinkSpeedEnabled", >mad_dump_linkspeeden}, >- [IB_PORT_NEIGHBOR_MTU_F] {BITSOFFS(288, 4), "NeighborMTU", >mad_dump_mtu}, >- [IB_PORT_SMSL_F] {BITSOFFS(292, 4), "SMSL", >mad_dump_uint}, >- [IB_PORT_VL_CAP_F] {BITSOFFS(296, 4), "VLCap", >mad_dump_vlcap}, >- [IB_PORT_INIT_TYPE_F] {BITSOFFS(300, 4), "InitType", >mad_dump_hex}, >- [IB_PORT_VL_HIGH_LIMIT_F] {BITSOFFS(304, 8), "VLHighLimit", >mad_dump_uint}, >- [IB_PORT_VL_ARBITRATION_HIGH_CAP_F] {BITSOFFS(312, 8), >"VLArbHighCap", mad_dump_uint}, >- [IB_PORT_VL_ARBITRATION_LOW_CAP_F] {BITSOFFS(320, 8), >"VLArbLowCap", mad_dump_uint}, >- >- [IB_PORT_INIT_TYPE_REPLY_F] {BITSOFFS(328, 4), "InitReply", >mad_dump_hex}, >- [IB_PORT_MTU_CAP_F] {BITSOFFS(332, 4), "MtuCap", >mad_dump_mtu}, >- [IB_PORT_VL_STALL_COUNT_F] {BITSOFFS(336, 3), "VLStallCount", >mad_dump_uint}, >- [IB_PORT_HOQ_LIFE_F] {BITSOFFS(339, 5), "HoqLife", >mad_dump_uint}, >- [IB_PORT_OPER_VLS_F] {BITSOFFS(344, 4), "OperVLs", >mad_dump_opervls}, >- [IB_PORT_PART_EN_INB_F] {BITSOFFS(348, 1), "PartEnforceInb", >mad_dump_uint}, >- [IB_PORT_PART_EN_OUTB_F] {BITSOFFS(349, 1), "PartEnforceOutb", >mad_dump_uint}, >- [IB_PORT_FILTER_RAW_INB_F] {BITSOFFS(350, 1), "FilterRawInb", >mad_dump_uint}, >- [IB_PORT_FILTER_RAW_OUTB_F] {BITSOFFS(351, 1), "FilterRawOutb", >mad_dump_uint}, >- [IB_PORT_MKEY_VIOL_F] {BITSOFFS(352, 16), "MkeyViolations", >mad_dump_uint}, >- [IB_PORT_PKEY_VIOL_F] {BITSOFFS(368, 16), "PkeyViolations", >mad_dump_uint}, >- [IB_PORT_QKEY_VIOL_F] {BITSOFFS(384, 16), "QkeyViolations", >mad_dump_uint}, >- [IB_PORT_GUID_CAP_F] {BITSOFFS(400, 8), "GuidCap", >mad_dump_uint}, >- [IB_PORT_CLIENT_REREG_F] {BITSOFFS(408, 1), "ClientReregister", >mad_dump_uint}, >- [IB_PORT_SUBN_TIMEOUT_F] {BITSOFFS(411, 5), "SubnetTimeout", >mad_dump_uint}, >- [IB_PORT_RESP_TIME_VAL_F] {BITSOFFS(419, 5), "RespTimeVal", >mad_dump_uint}, >- [IB_PORT_LOCAL_PHYS_ERR_F] {BITSOFFS(424, 4), "LocalPhysErr", >mad_dump_uint}, >- [IB_PORT_OVERRUN_ERR_F] {BITSOFFS(428, 4), "OverrunErr", >mad_dump_uint}, >- [IB_PORT_MAX_CREDIT_HINT_F] {BITSOFFS(432, 16), "MaxCreditHint", >mad_dump_uint}, >- [IB_PORT_LINK_ROUND_TRIP_F] {BITSOFFS(456, 24), "RoundTrip", >mad_dump_uint}, >+ {0, 64, "Mkey", mad_dump_hex}, >+ {64, 64, "GidPrefix", mad_dump_hex}, >+ {BITSOFFS(128, 16), "Lid", mad_dump_hex}, >+ {BITSOFFS(144, 16), "SMLid", mad_dump_hex}, >+ {160, 32, "CapMask", mad_dump_portcapmask}, >+ {BITSOFFS(192, 16), "DiagCode", mad_dump_hex}, >+ {BITSOFFS(208, 16), "MkeyLeasePeriod", mad_dump_uint}, >+ {BITSOFFS(224, 8), "LocalPort", mad_dump_uint}, >+ {BITSOFFS(232, 8), "LinkWidthEnabled", mad_dump_linkwidthen}, >+ {BITSOFFS(240, 8), "LinkWidthSupported", mad_dump_linkwidthsup}, >+ {BITSOFFS(248, 8), "LinkWidthActive", mad_dump_linkwidth}, >+ {BITSOFFS(256, 4), "LinkSpeedSupported", mad_dump_linkspeedsup}, >+ {BITSOFFS(260, 4), "LinkState", mad_dump_portstate}, >+ {BITSOFFS(264, 4), "PhysLinkState", mad_dump_physportstate}, >+ {BITSOFFS(268, 4), "LinkDownDefState", mad_dump_linkdowndefstate}, >+ {BITSOFFS(272, 2), "ProtectBits", mad_dump_uint}, >+ {BITSOFFS(277, 3), "LMC", mad_dump_uint}, >+ {BITSOFFS(280, 4), "LinkSpeedActive", mad_dump_linkspeed}, >+ {BITSOFFS(284, 4), "LinkSpeedEnabled", mad_dump_linkspeeden}, >+ {BITSOFFS(288, 4), "NeighborMTU", mad_dump_mtu}, >+ {BITSOFFS(292, 4), "SMSL", mad_dump_uint}, >+ {BITSOFFS(296, 4), "VLCap", mad_dump_vlcap}, >+ {BITSOFFS(300, 4), "InitType", mad_dump_hex}, >+ {BITSOFFS(304, 8), "VLHighLimit", mad_dump_uint}, >+ {BITSOFFS(312, 8), "VLArbHighCap", mad_dump_uint}, >+ {BITSOFFS(320, 8), "VLArbLowCap", mad_dump_uint}, >+ {BITSOFFS(328, 4), "InitReply", mad_dump_hex}, >+ {BITSOFFS(332, 4), "MtuCap", mad_dump_mtu}, >+ {BITSOFFS(336, 3), "VLStallCount", mad_dump_uint}, >+ {BITSOFFS(339, 5), "HoqLife", mad_dump_uint}, >+ {BITSOFFS(344, 4), "OperVLs", mad_dump_opervls}, >+ {BITSOFFS(348, 1), "PartEnforceInb", mad_dump_uint}, >+ {BITSOFFS(349, 1), "PartEnforceOutb", mad_dump_uint}, >+ {BITSOFFS(350, 1), "FilterRawInb", mad_dump_uint}, >+ {BITSOFFS(351, 1), "FilterRawOutb", mad_dump_uint}, >+ {BITSOFFS(352, 16), "MkeyViolations", mad_dump_uint}, >+ {BITSOFFS(368, 16), "PkeyViolations", mad_dump_uint}, >+ {BITSOFFS(384, 16), "QkeyViolations", mad_dump_uint}, >+ {BITSOFFS(400, 8), "GuidCap", mad_dump_uint}, >+ {BITSOFFS(408, 1), "ClientReregister", mad_dump_uint}, >+ {BITSOFFS(411, 5), "SubnetTimeout", mad_dump_uint}, >+ {BITSOFFS(419, 5), "RespTimeVal", mad_dump_uint}, >+ {BITSOFFS(424, 4), "LocalPhysErr", mad_dump_uint}, >+ {BITSOFFS(428, 4), "OverrunErr", mad_dump_uint}, >+ {BITSOFFS(432, 16), "MaxCreditHint", mad_dump_uint}, >+ {BITSOFFS(456, 24), "RoundTrip", mad_dump_uint}, >+ {0, 0}, /* IB_PORT_LAST_F */ > > /* > * NodeInfo fields: > */ >- [IB_NODE_BASE_VERS_F] {BITSOFFS(0,8), "BaseVers", >mad_dump_uint}, >- [IB_NODE_CLASS_VERS_F] {BITSOFFS(8,8), "ClassVers", >mad_dump_uint}, >- [IB_NODE_TYPE_F] {BITSOFFS(16,8), "NodeType", >mad_dump_node_type}, >- [IB_NODE_NPORTS_F] {BITSOFFS(24,8), "NumPorts", >mad_dump_uint}, >- [IB_NODE_SYSTEM_GUID_F] {32, 64, "SystemGuid", mad_dump_hex}, >- [IB_NODE_GUID_F] {96, 64, "Guid", mad_dump_hex}, >- [IB_NODE_PORT_GUID_F] {160, 64, "PortGuid", mad_dump_hex}, >- [IB_NODE_PARTITION_CAP_F] {BITSOFFS(224,16), "PartCap", >mad_dump_uint}, >- [IB_NODE_DEVID_F] {BITSOFFS(240,16), "DevId", >mad_dump_hex}, >- [IB_NODE_REVISION_F] {256, 32, "Revision", mad_dump_hex}, >- [IB_NODE_LOCAL_PORT_F] {BITSOFFS(288,8), "LocalPort", >mad_dump_uint}, >- [IB_NODE_VENDORID_F] {BITSOFFS(296,24), "VendorId", >mad_dump_hex}, >+ {BITSOFFS(0,8), "BaseVers", mad_dump_uint}, >+ {BITSOFFS(8,8), "ClassVers", mad_dump_uint}, >+ {BITSOFFS(16,8), "NodeType", mad_dump_node_type}, >+ {BITSOFFS(24,8), "NumPorts", mad_dump_uint}, >+ {32, 64, "SystemGuid", mad_dump_hex}, >+ {96, 64, "Guid", mad_dump_hex}, >+ {160, 64, "PortGuid", mad_dump_hex}, >+ {BITSOFFS(224,16), "PartCap", mad_dump_uint}, >+ {BITSOFFS(240,16), "DevId", mad_dump_hex}, >+ {256, 32, "Revision", mad_dump_hex}, >+ {BITSOFFS(288,8), "LocalPort", mad_dump_uint}, >+ {BITSOFFS(296,24), "VendorId", mad_dump_hex}, >+ {0, 0}, /* IB_NODE_LAST_F */ >+ > > /* > * SwitchInfo fields: > */ >- [IB_SW_LINEAR_FDB_CAP_F] {BITSOFFS(0, 16), "LinearFdbCap", >mad_dump_uint}, >- [IB_SW_RANDOM_FDB_CAP_F] {BITSOFFS(16, 16), "RandomFdbCap", >mad_dump_uint}, >- [IB_SW_MCAST_FDB_CAP_F] {BITSOFFS(32, 16), "McastFdbCap", >mad_dump_uint}, >- [IB_SW_LINEAR_FDB_TOP_F] {BITSOFFS(48, 16), "LinearFdbTop", >mad_dump_uint}, >- [IB_SW_DEF_PORT_F] {BITSOFFS(64, 8), "DefPort", >mad_dump_uint}, >- [IB_SW_DEF_MCAST_PRIM_F] {BITSOFFS(72, 8), "DefMcastPrimPort", >mad_dump_uint}, >- [IB_SW_DEF_MCAST_NOT_PRIM_F] {BITSOFFS(80, 8), >"DefMcastNotPrimPort", mad_dump_uint}, >- [IB_SW_LIFE_TIME_F] {BITSOFFS(88, 5), "LifeTime", >mad_dump_uint}, >- [IB_SW_STATE_CHANGE_F] {BITSOFFS(93, 1), "StateChange", >mad_dump_uint}, >- [IB_SW_LIDS_PER_PORT_F] {BITSOFFS(96,16), "LidsPerPort", >mad_dump_uint}, >- [IB_SW_PARTITION_ENFORCE_CAP_F] {BITSOFFS(112, 16), "PartEnforceCap", >mad_dump_uint}, >- [IB_SW_PARTITION_ENF_INB_F] {BITSOFFS(128, 1), "InboundPartEnf", >mad_dump_uint}, >- [IB_SW_PARTITION_ENF_OUTB_F] {BITSOFFS(129, 1), "OutboundPartEnf", >mad_dump_uint}, >- [IB_SW_FILTER_RAW_INB_F] {BITSOFFS(130, 1), "FilterRawInbound", >mad_dump_uint}, >- [IB_SW_FILTER_RAW_OUTB_F] {BITSOFFS(131, 1), "FilterRawOutbound", >mad_dump_uint}, >- [IB_SW_ENHANCED_PORT0_F] {BITSOFFS(132, 1), "EnhancedPort0", >mad_dump_uint}, >+ {BITSOFFS(0, 16), "LinearFdbCap", mad_dump_uint}, >+ {BITSOFFS(16, 16), "RandomFdbCap", mad_dump_uint}, >+ {BITSOFFS(32, 16), "McastFdbCap", mad_dump_uint}, >+ {BITSOFFS(48, 16), "LinearFdbTop", mad_dump_uint}, >+ {BITSOFFS(64, 8), "DefPort", mad_dump_uint}, >+ {BITSOFFS(72, 8), "DefMcastPrimPort", mad_dump_uint}, >+ {BITSOFFS(80, 8), "DefMcastNotPrimPort", mad_dump_uint}, >+ {BITSOFFS(88, 5), "LifeTime", mad_dump_uint}, >+ {BITSOFFS(93, 1), "StateChange", mad_dump_uint}, >+ {BITSOFFS(96,16), "LidsPerPort", mad_dump_uint}, >+ {BITSOFFS(112, 16), "PartEnforceCap", mad_dump_uint}, >+ {BITSOFFS(128, 1), "InboundPartEnf", mad_dump_uint}, >+ {BITSOFFS(129, 1), "OutboundPartEnf", mad_dump_uint}, >+ {BITSOFFS(130, 1), "FilterRawInbound", mad_dump_uint}, >+ {BITSOFFS(131, 1), "FilterRawOutbound", mad_dump_uint}, >+ {BITSOFFS(132, 1), "EnhancedPort0", mad_dump_uint}, >+ {0, 0}, /* IB_SW_LAST_F */ > > /* > * SwitchLinearForwardingTable fields: > */ >- [IB_LINEAR_FORW_TBL_F] {0, 512, "LinearForwTbl", >mad_dump_array}, >+ {0, 512, "LinearForwTbl", mad_dump_array}, > > /* > * SwitchMulticastForwardingTable fields: > */ >- [IB_MULTICAST_FORW_TBL_F] {0, 512, "MulticastForwTbl", >mad_dump_array}, >+ {0, 512, "MulticastForwTbl", mad_dump_array}, > > /* >- * Notice/Trap fields >+ * NodeDescription fields: > */ >- [IB_NOTICE_IS_GENERIC_F] {BITSOFFS(0, 1), "NoticeIsGeneric", >mad_dump_uint}, >- [IB_NOTICE_TYPE_F] {BITSOFFS(1, 7), "NoticeType", >mad_dump_uint}, >- [IB_NOTICE_PRODUCER_F] {BITSOFFS(8, 24), "NoticeProducerType", >mad_dump_node_type}, >- [IB_NOTICE_TRAP_NUMBER_F] {BITSOFFS(32, 16), "NoticeTrapNumber", >mad_dump_uint}, >- [IB_NOTICE_ISSUER_LID_F] {BITSOFFS(48, 16), "NoticeIssuerLID", >mad_dump_uint}, >- [IB_NOTICE_TOGGLE_F] {BITSOFFS(64, 1), "NoticeToggle", >mad_dump_uint}, >- [IB_NOTICE_COUNT_F] {BITSOFFS(65, 15), "NoticeCount", >mad_dump_uint}, >- [IB_NOTICE_DATA_DETAILS_F] {80, 432, "NoticeDataDetails", >mad_dump_array}, >- [IB_NOTICE_DATA_LID_F] {BITSOFFS(80, 16), "NoticeDataLID", >mad_dump_uint}, >- [IB_NOTICE_DATA_144_LID_F] {BITSOFFS(96, 16), >"NoticeDataTrap144LID", mad_dump_uint}, >- [IB_NOTICE_DATA_144_CAPMASK_F] {BITSOFFS(128, 32), >"NoticeDataTrap144CapMask", mad_dump_uint}, >+ {0, 64*8, "NodeDesc", mad_dump_string}, > > /* >- * NodeDescription fields: >+ * Notice/Trap fields > */ >- [IB_NODE_DESC_F] {0, 64*8, "NodeDesc", mad_dump_string}, >+ {BITSOFFS(0, 1), "NoticeIsGeneric", mad_dump_uint}, >+ {BITSOFFS(1, 7), "NoticeType", mad_dump_uint}, >+ {BITSOFFS(8, 24), "NoticeProducerType", mad_dump_node_type}, >+ {BITSOFFS(32, 16), "NoticeTrapNumber", mad_dump_uint}, >+ {BITSOFFS(48, 16), "NoticeIssuerLID", mad_dump_uint}, >+ {BITSOFFS(64, 1), "NoticeToggle", mad_dump_uint}, >+ {BITSOFFS(65, 15), "NoticeCount", mad_dump_uint}, >+ {80, 432, "NoticeDataDetails", mad_dump_array}, >+ {BITSOFFS(80, 16), "NoticeDataLID", mad_dump_uint}, >+ {BITSOFFS(96, 16), "NoticeDataTrap144LID", mad_dump_uint}, >+ {BITSOFFS(128, 32), "NoticeDataTrap144CapMask", mad_dump_uint}, > > /* > * Port counters > */ >- [IB_PC_PORT_SELECT_F] {BITSOFFS(8, 8), "PortSelect", >mad_dump_uint}, >- [IB_PC_COUNTER_SELECT_F] {BITSOFFS(16, 16), "CounterSelect", >mad_dump_hex}, >- [IB_PC_ERR_SYM_F] {BITSOFFS(32, 16), "SymbolErrors", >mad_dump_uint}, >- [IB_PC_LINK_RECOVERS_F] {BITSOFFS(48, 8), "LinkRecovers", >mad_dump_uint}, >- [IB_PC_LINK_DOWNED_F] {BITSOFFS(56, 8), "LinkDowned", >mad_dump_uint}, >- [IB_PC_ERR_RCV_F] {BITSOFFS(64, 16), "RcvErrors", >mad_dump_uint}, >- [IB_PC_ERR_PHYSRCV_F] {BITSOFFS(80, 16), >"RcvRemotePhysErrors", mad_dump_uint}, >- [IB_PC_ERR_SWITCH_REL_F] {BITSOFFS(96, 16), "RcvSwRelayErrors", >mad_dump_uint}, >- [IB_PC_XMT_DISCARDS_F] {BITSOFFS(112, 16), "XmtDiscards", >mad_dump_uint}, >- [IB_PC_ERR_XMTCONSTR_F] {BITSOFFS(128, 8), >"XmtConstraintErrors", mad_dump_uint}, >- [IB_PC_ERR_RCVCONSTR_F] {BITSOFFS(136, 8), >"RcvConstraintErrors", mad_dump_uint}, >- [IB_PC_COUNTER_SELECT2_F] {BITSOFFS(144, 8), "CounterSelect2", >mad_dump_uint}, >- [IB_PC_ERR_LOCALINTEG_F] {BITSOFFS(152, 4), >"LinkIntegrityErrors", mad_dump_uint}, >- [IB_PC_ERR_EXCESS_OVR_F] {BITSOFFS(156, 4), >"ExcBufOverrunErrors", mad_dump_uint}, >- [IB_PC_VL15_DROPPED_F] {BITSOFFS(176, 16), "VL15Dropped", >mad_dump_uint}, >- [IB_PC_XMT_BYTES_F] {192, 32, "XmtData", mad_dump_uint}, >- [IB_PC_RCV_BYTES_F] {224, 32, "RcvData", mad_dump_uint}, >- [IB_PC_XMT_PKTS_F] {256, 32, "XmtPkts", mad_dump_uint}, >- [IB_PC_RCV_PKTS_F] {288, 32, "RcvPkts", mad_dump_uint}, >- [IB_PC_XMT_WAIT_F] {320, 32, "XmtWait", mad_dump_uint}, >+ {BITSOFFS(8, 8), "PortSelect", mad_dump_uint}, >+ {BITSOFFS(16, 16), "CounterSelect", mad_dump_hex}, >+ {BITSOFFS(32, 16), "SymbolErrors", mad_dump_uint}, >+ {BITSOFFS(48, 8), "LinkRecovers", mad_dump_uint}, >+ {BITSOFFS(56, 8), "LinkDowned", mad_dump_uint}, >+ {BITSOFFS(64, 16), "RcvErrors", mad_dump_uint}, >+ {BITSOFFS(80, 16), "RcvRemotePhysErrors", mad_dump_uint}, >+ {BITSOFFS(96, 16), "RcvSwRelayErrors", mad_dump_uint}, >+ {BITSOFFS(112, 16), "XmtDiscards", mad_dump_uint}, >+ {BITSOFFS(128, 8), "XmtConstraintErrors", mad_dump_uint}, >+ {BITSOFFS(136, 8), "RcvConstraintErrors", mad_dump_uint}, >+ {BITSOFFS(144, 8), "CounterSelect2", mad_dump_uint}, >+ {BITSOFFS(152, 4), "LinkIntegrityErrors", mad_dump_uint}, >+ {BITSOFFS(156, 4), "ExcBufOverrunErrors", mad_dump_uint}, >+ {BITSOFFS(176, 16), "VL15Dropped", mad_dump_uint}, >+ {192, 32, "XmtData", mad_dump_uint}, >+ {224, 32, "RcvData", mad_dump_uint}, >+ {256, 32, "XmtPkts", mad_dump_uint}, >+ {288, 32, "RcvPkts", mad_dump_uint}, >+ {320, 32, "XmtWait", mad_dump_uint}, >+ {0, 0}, /* IB_PC_LAST_F */ > > /* > * SMInfo > */ >- [IB_SMINFO_GUID_F] {0, 64, "SmInfoGuid", mad_dump_hex}, >- [IB_SMINFO_KEY_F] {64, 64, "SmInfoKey", mad_dump_hex}, >- [IB_SMINFO_ACT_F] {128, 32, "SmActivity", mad_dump_uint}, >- [IB_SMINFO_PRIO_F] {BITSOFFS(160, 4), "SmPriority", >mad_dump_uint}, >- [IB_SMINFO_STATE_F] {BITSOFFS(164, 4), "SmState", >mad_dump_uint}, >+ {0, 64, "SmInfoGuid", mad_dump_hex}, >+ {64, 64, "SmInfoKey", mad_dump_hex}, >+ {128, 32, "SmActivity", mad_dump_uint}, >+ {BITSOFFS(160, 4), "SmPriority", mad_dump_uint}, >+ {BITSOFFS(164, 4), "SmState", mad_dump_uint}, > > /* > * SA RMPP > */ >- [IB_SA_RMPP_VERS_F] {BE_OFFS(24*8+24, 8), "RmppVers", >mad_dump_uint}, >- [IB_SA_RMPP_TYPE_F] {BE_OFFS(24*8+16, 8), "RmppType", >mad_dump_uint}, >- [IB_SA_RMPP_RESP_F] {BE_OFFS(24*8+11, 5), "RmppResp", >mad_dump_uint}, >- [IB_SA_RMPP_FLAGS_F] {BE_OFFS(24*8+8, 3), "RmppFlags", >mad_dump_hex}, >- [IB_SA_RMPP_STATUS_F] {BE_OFFS(24*8+0, 8), "RmppStatus", >mad_dump_hex}, >+ {BE_OFFS(24*8+24, 8), "RmppVers", mad_dump_uint}, >+ {BE_OFFS(24*8+16, 8), "RmppType", mad_dump_uint}, >+ {BE_OFFS(24*8+11, 5), "RmppResp", mad_dump_uint}, >+ {BE_OFFS(24*8+8, 3), "RmppFlags", mad_dump_hex}, >+ {BE_OFFS(24*8+0, 8), "RmppStatus", mad_dump_hex}, > > /* data1 */ >- [IB_SA_RMPP_D1_F] {28*8, 32, "RmppData1", mad_dump_hex}, >- [IB_SA_RMPP_SEGNUM_F] {28*8, 32, "RmppSegNum", >mad_dump_uint}, >+ {28*8, 32, "RmppData1", mad_dump_hex}, >+ {28*8, 32, "RmppSegNum", mad_dump_uint}, > /* data2 */ >- [IB_SA_RMPP_D2_F] {32*8, 32, "RmppData2", mad_dump_hex}, >- [IB_SA_RMPP_LEN_F] {32*8, 32, "RmppPayload", >mad_dump_uint}, >- [IB_SA_RMPP_NEWWIN_F] {32*8, 32, "RmppNewWin", >mad_dump_uint}, >- >+ {32*8, 32, "RmppData2", mad_dump_hex}, >+ {32*8, 32, "RmppPayload", mad_dump_uint}, >+ {32*8, 32, "RmppNewWin", mad_dump_uint}, >+ > /* >- * SA Path rec >+ * SA Get Multi Path > */ >- [IB_SA_PR_DGID_F] {64, 128, "PathRecDGid", >mad_dump_array}, >- [IB_SA_PR_SGID_F] {192, 128, "PathRecSGid", >mad_dump_array}, >- [IB_SA_PR_DLID_F] {BITSOFFS(320,16), "PathRecDLid", >mad_dump_hex}, >- [IB_SA_PR_SLID_F] {BITSOFFS(336,16), "PathRecSLid", >mad_dump_hex}, >- [IB_SA_PR_NPATH_F] {BITSOFFS(393,7), "PathRecNumPath", >mad_dump_uint}, >+ {BITSOFFS(41,7), "MultiPathNumPath", mad_dump_uint}, >+ {BITSOFFS(120,8), "MultiPathNumSrc", mad_dump_uint}, >+ {BITSOFFS(128,8), "MultiPathNumDest", mad_dump_uint}, >+ {192, 128, "MultiPathGid", mad_dump_array}, > > /* >- * SA Get Multi Path >+ * SA Path rec > */ >- [IB_SA_MP_NPATH_F] {BITSOFFS(41,7), "MultiPathNumPath", >mad_dump_uint}, >- [IB_SA_MP_NSRC_F] {BITSOFFS(120,8), "MultiPathNumSrc", >mad_dump_uint}, >- [IB_SA_MP_NDEST_F] {BITSOFFS(128,8), "MultiPathNumDest", >mad_dump_uint}, >- [IB_SA_MP_GID0_F] {192, 128, "MultiPathGid", >mad_dump_array}, >+ {64, 128, "PathRecDGid", mad_dump_array}, >+ {192, 128, "PathRecSGid", mad_dump_array}, >+ {BITSOFFS(320,16), "PathRecDLid", mad_dump_hex}, >+ {BITSOFFS(336,16), "PathRecSLid", mad_dump_hex}, >+ {BITSOFFS(393,7), "PathRecNumPath", mad_dump_uint}, > > /* > * MC Member rec > */ >- [IB_SA_MCM_MGID_F] {0, 128, "McastMemMGid", >mad_dump_array}, >- [IB_SA_MCM_PORTGID_F] {128, 128, "McastMemPortGid", >mad_dump_array}, >- [IB_SA_MCM_QKEY_F] {256, 32, "McastMemQkey", >mad_dump_hex}, >- [IB_SA_MCM_MLID_F] {BITSOFFS(288, 16), "McastMemMLid", >mad_dump_hex}, >- [IB_SA_MCM_MTU_F] {BITSOFFS(306, 6), "McastMemMTU", >mad_dump_uint}, >- [IB_SA_MCM_TCLASS_F] {BITSOFFS(312, 8), "McastMemTClass", >mad_dump_uint}, >- [IB_SA_MCM_PKEY_F] {BITSOFFS(320, 16), "McastMemPkey", >mad_dump_uint}, >- [IB_SA_MCM_RATE_F] {BITSOFFS(338, 6), "McastMemRate", >mad_dump_uint}, >- [IB_SA_MCM_SL_F] {BITSOFFS(352, 4), "McastMemSL", >mad_dump_uint}, >- [IB_SA_MCM_FLOW_LABEL_F] {BITSOFFS(356, 20), "McastMemFlowLbl", >mad_dump_uint}, >- [IB_SA_MCM_JOIN_STATE_F] {BITSOFFS(388, 4), "McastMemJoinState", >mad_dump_uint}, >- [IB_SA_MCM_PROXY_JOIN_F] {BITSOFFS(392, 1), "McastMemProxyJoin", >mad_dump_uint}, >+ {0, 128, "McastMemMGid", mad_dump_array}, >+ {128, 128, "McastMemPortGid", mad_dump_array}, >+ {256, 32, "McastMemQkey", mad_dump_hex}, >+ {BITSOFFS(288, 16), "McastMemMLid", mad_dump_hex}, >+ {BITSOFFS(352, 4), "McastMemSL", mad_dump_uint}, >+ {BITSOFFS(306, 6), "McastMemMTU", mad_dump_uint}, >+ {BITSOFFS(338, 6), "McastMemRate", mad_dump_uint}, >+ {BITSOFFS(312, 8), "McastMemTClass", mad_dump_uint}, >+ {BITSOFFS(320, 16), "McastMemPkey", mad_dump_uint}, >+ {BITSOFFS(356, 20), "McastMemFlowLbl", mad_dump_uint}, >+ {BITSOFFS(388, 4), "McastMemJoinState", mad_dump_uint}, >+ {BITSOFFS(392, 1), "McastMemProxyJoin", mad_dump_uint}, > > /* > * Service record > */ >- [IB_SA_SR_ID_F] {0, 64, "ServRecID", mad_dump_hex}, >- [IB_SA_SR_GID_F] {64, 128, "ServRecGid", >mad_dump_array}, >- [IB_SA_SR_PKEY_F] {BITSOFFS(192, 16), "ServRecPkey", >mad_dump_hex}, >- [IB_SA_SR_LEASE_F] {224, 32, "ServRecLease", >mad_dump_hex}, >- [IB_SA_SR_KEY_F] {256, 128, "ServRecKey", mad_dump_hex}, >- [IB_SA_SR_NAME_F] {384, 512, "ServRecName", >mad_dump_string}, >- [IB_SA_SR_DATA_F] {896, 512, "ServRecData", >mad_dump_array}, /* ATS for example */ >+ {0, 64, "ServRecID", mad_dump_hex}, >+ {64, 128, "ServRecGid", mad_dump_array}, >+ {BITSOFFS(192, 16), "ServRecPkey", mad_dump_hex}, >+ {224, 32, "ServRecLease", mad_dump_hex}, >+ {256, 128, "ServRecKey", mad_dump_hex}, >+ {384, 512, "ServRecName", mad_dump_string}, >+ {896, 512, "ServRecData", mad_dump_array}, /* ATS for example */ > > /* > * ATS SM record - within SA_SR_DATA > */ >- [IB_ATS_SM_NODE_ADDR_F] {12*8, 32, "ATSNodeAddr", >mad_dump_hex}, >- [IB_ATS_SM_MAGIC_KEY_F] {BITSOFFS(16*8, 16), "ATSMagicKey", >mad_dump_hex}, >- [IB_ATS_SM_NODE_TYPE_F] {BITSOFFS(18*8, 16), "ATSNodeType", >mad_dump_hex}, >- [IB_ATS_SM_NODE_NAME_F] {32*8, 32*8, "ATSNodeName", >mad_dump_string}, >+ {12*8, 32, "ATSNodeAddr", mad_dump_hex}, >+ {BITSOFFS(16*8, 16), "ATSMagicKey", mad_dump_hex}, >+ {BITSOFFS(18*8, 16), "ATSNodeType", mad_dump_hex}, >+ {32*8, 32*8, "ATSNodeName", mad_dump_string}, > > /* > * SLTOVL MAPPING TABLE > */ >- [IB_SLTOVL_MAPPING_TABLE_F] {0, 64, "SLToVLMap", mad_dump_hex}, >+ {0, 64, "SLToVLMap", mad_dump_hex}, > > /* > * VL ARBITRATION TABLE > */ >- [IB_VL_ARBITRATION_TABLE_F] {0, 512, "VLArbTbl", mad_dump_array}, >+ {0, 512, "VLArbTbl", mad_dump_array}, > > /* > * IB vendor classes range 2 > */ >- [IB_VEND2_OUI_F] {BE_OFFS(36*8, 24), "OUI", >mad_dump_array}, >- [IB_VEND2_DATA_F] {40*8, (256-40)*8, "Vendor2Data", >mad_dump_array}, >+ {BE_OFFS(36*8, 24), "OUI", mad_dump_array}, >+ {40*8, (256-40)*8, "Vendor2Data", mad_dump_array}, > > /* > * Extended port counters > */ >- [IB_PC_EXT_PORT_SELECT_F] {BITSOFFS(8, 8), "PortSelect", >mad_dump_uint}, >- [IB_PC_EXT_COUNTER_SELECT_F] {BITSOFFS(16, 16), "CounterSelect", >mad_dump_hex}, >- [IB_PC_EXT_XMT_BYTES_F] {64, 64, "PortXmitData", >mad_dump_uint}, >- [IB_PC_EXT_RCV_BYTES_F] {128, 64, "PortRcvData", >mad_dump_uint}, >- [IB_PC_EXT_XMT_PKTS_F] {192, 64, "PortXmitPkts", >mad_dump_uint}, >- [IB_PC_EXT_RCV_PKTS_F] {256, 64, "PortRcvPkts", >mad_dump_uint}, >- [IB_PC_EXT_XMT_UPKTS_F] {320, 64, "PortUnicastXmitPkts", >mad_dump_uint}, >- [IB_PC_EXT_RCV_UPKTS_F] {384, 64, "PortUnicastRcvPkts", >mad_dump_uint}, >- [IB_PC_EXT_XMT_MPKTS_F] {448, 64, "PortMulticastXmitPkts", >mad_dump_uint}, >- [IB_PC_EXT_RCV_MPKTS_F] {512, 64, "PortMulticastRcvPkts", >mad_dump_uint}, >+ {BITSOFFS(8, 8), "PortSelect", mad_dump_uint}, >+ {BITSOFFS(16, 16), "CounterSelect", mad_dump_hex}, >+ {64, 64, "PortXmitData", mad_dump_uint}, >+ {128, 64, "PortRcvData", mad_dump_uint}, >+ {192, 64, "PortXmitPkts", mad_dump_uint}, >+ {256, 64, "PortRcvPkts", mad_dump_uint}, >+ {320, 64, "PortUnicastXmitPkts", mad_dump_uint}, >+ {384, 64, "PortUnicastRcvPkts", mad_dump_uint}, >+ {448, 64, "PortMulticastXmitPkts", mad_dump_uint}, >+ {512, 64, "PortMulticastRcvPkts", mad_dump_uint}, >+ {0, 0}, /* IB_PC_EXT_LAST_F */ > > /* > * GUIDInfo fields > */ >- [IB_GUID_GUID0_F] {0, 64, "GUID0", mad_dump_hex}, >+ {0, 64, "GUID0", mad_dump_hex}, >+ {0, 0} /* IB_FIELD_LAST_ */ > > }; > >@@ -561,9 +576,12 @@ static char *_mad_dump_field(const ib_field_t *f, const >char *name, char *buf, i > > static int _mad_dump(ib_mad_dump_fn *fn, const char *name, void *val, int >valsz) > { >- ib_field_t f = { .def_dump_fn = fn, .bitlen = valsz * 8}; >+ ib_field_t f; > char buf[512]; > >+ f.def_dump_fn = fn; >+ f.bitlen = valsz * 8; >+ > return printf("%s\n", _mad_dump_field(&f, name, buf, sizeof buf, val)); > } > >-- >1.5.2.5 > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Mon Jan 19 09:40:58 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 19 Jan 2009 09:40:58 -0800 Subject: [ofa-general] [PATCH 3/3 v2] Minor changes to allow portability to WinOF In-Reply-To: References: Message-ID: <000301c97a5d$16e36bc0$dd5b180a@amr.corp.intel.com> adding ofw mail list >- cleanup unnecessary include files >- ULL declaration on 64 bit constants >- cast or change data types to fix build warnings on windows > >Signed-off-by: Arlin Davis >--- > libibmad/src/dump.c | 25 +++++++++---------------- > libibmad/src/gs.c | 7 +------ > libibmad/src/mad.c | 14 +++++--------- > libibmad/src/portid.c | 9 +++------ > libibmad/src/register.c | 3 +-- > libibmad/src/resolve.c | 3 +-- > libibmad/src/rpc.c | 9 ++++----- > libibmad/src/sa.c | 1 - > libibmad/src/serv.c | 4 +--- > libibmad/src/smp.c | 1 - > libibmad/src/vendor.c | 1 - > 11 files changed, 25 insertions(+), 52 deletions(-) > >diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c >index 38a2254..a7575b9 100644 >--- a/libibmad/src/dump.c >+++ b/libibmad/src/dump.c >@@ -36,13 +36,6 @@ > # include > #endif /* HAVE_CONFIG_H */ > >-#include >-#include >-#include >-#include >-#include >-#include >- > #include > > void >@@ -114,13 +107,13 @@ mad_dump_hex(char *buf, int bufsz, void *val, int valsz) > snprintf(buf, bufsz, "0x%08x", *(uint32_t *)val); > break; > case 5: >- snprintf(buf, bufsz, "0x%010" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffllu); >+ snprintf(buf, bufsz, "0x%010" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffULL); > break; > case 6: >- snprintf(buf, bufsz, "0x%012" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffffllu); >+ snprintf(buf, bufsz, "0x%012" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffffULL); > break; > case 7: >- snprintf(buf, bufsz, "0x%014" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffffffllu); >+ snprintf(buf, bufsz, "0x%014" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffffffULL); > break; > case 8: > snprintf(buf, bufsz, "0x%016" PRIx64, *(uint64_t *)val); >@@ -148,13 +141,13 @@ mad_dump_rhex(char *buf, int bufsz, void *val, int valsz) > snprintf(buf, bufsz, "%08x", *(uint32_t *)val); > break; > case 5: >- snprintf(buf, bufsz, "%010" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffllu); >+ snprintf(buf, bufsz, "%010" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffULL); > break; > case 6: >- snprintf(buf, bufsz, "%012" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffffllu); >+ snprintf(buf, bufsz, "%012" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffffULL); > break; > case 7: >- snprintf(buf, bufsz, "%014" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffffffllu); >+ snprintf(buf, bufsz, "%014" PRIx64, *(uint64_t *)val & (uint64_t) >0xffffffffffffffULL); > break; > case 8: > snprintf(buf, bufsz, "%016" PRIx64, *(uint64_t *)val); >@@ -606,7 +599,7 @@ typedef struct _ib_vl_arb_table { > uint8_t res_vl; > uint8_t weight; > } vl_entry[IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK]; >-} __attribute__((packed)) ib_vl_arb_table_t; >+} ib_vl_arb_table_t; > > static inline void > ib_vl_arb_get_vl(uint8_t res_vl, uint8_t *const vl ) >@@ -634,7 +627,7 @@ void > mad_dump_vlarbitration(char *buf, int bufsz, void *val, int num) > { > ib_vl_arb_table_t* p_vla_tbl = val; >- unsigned i, n; >+ int i, n; > uint8_t vl; > > num /= sizeof(p_vla_tbl->vl_entry[0]); >@@ -681,7 +674,7 @@ _dump_fields(char *buf, int bufsz, void *data, int start, >int end) > bufsz -= n; > } > >- return s - buf; >+ return (int)(s - buf); > } > > void >diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c >index d350c0d..05ee132 100644 >--- a/libibmad/src/gs.c >+++ b/libibmad/src/gs.c >@@ -35,13 +35,8 @@ > # include > #endif /* HAVE_CONFIG_H */ > >-#include >-#include >-#include >-#include >- >-#include > #include "mad.h" >+#include > > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN >diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c >index be27c09..4a12e6c 100644 >--- a/libibmad/src/mad.c >+++ b/libibmad/src/mad.c >@@ -35,14 +35,10 @@ > # include > #endif /* HAVE_CONFIG_H */ > >-#include >-#include >-#include >-#include > #include > >-#include > #include >+#include > > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN >@@ -55,7 +51,7 @@ mad_trid(void) > uint64_t next; > > if (!base) { >- srandom(time(0)*getpid()); >+ srandom((int)time(0)*getpid()); > base = random(); > trid = random(); > } >@@ -97,8 +93,8 @@ mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, >void *data) > mad_set_field(buf, 0, IB_MAD_ATTRMOD_F, rpc->attr.mod); > > /* words 7,8 */ >- mad_set_field(buf, 0, IB_MAD_MKEY_F, rpc->mkey >> 32); >- mad_set_field(buf, 4, IB_MAD_MKEY_F, rpc->mkey & 0xffffffff); >+ mad_set_field(buf, 0, IB_MAD_MKEY_F, (uint32_t)(rpc->mkey >> 32)); >+ mad_set_field(buf, 4, IB_MAD_MKEY_F, (uint32_t)(rpc->mkey & 0xffffffff)); > > if (rpc->mgtclass == IB_SMI_DIRECT_CLASS) { > /* word 9 */ >@@ -168,5 +164,5 @@ mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t >*dport, > mad_set_field(mad, 0, IB_SA_RMPP_D2_F, rmpp->d2.u); > } > >- return p - mad; >+ return ((int)(p - mad)); > } >diff --git a/libibmad/src/portid.c b/libibmad/src/portid.c >index 61e6be0..9a69728 100644 >--- a/libibmad/src/portid.c >+++ b/libibmad/src/portid.c >@@ -37,10 +37,7 @@ > > #include > #include >-#include > #include >-#include >-#include > > #include > >@@ -94,7 +91,7 @@ str2drpath(ib_dr_path_t *path, char *routepath, int drslid, >int drdlid) > while (str && *str) { > if ((s = strchr(str, ','))) > *s = 0; >- path->p[++path->cnt] = atoi(str); >+ path->p[++path->cnt] = (uint8_t)atoi(str); > if (!s) > break; > str = s+1; >@@ -112,11 +109,11 @@ drpath2str(ib_dr_path_t *path, char *dstr, size_t >dstr_size) > int i = 0; > int rc = snprintf(dstr, dstr_size, "slid %d; dlid %d; %d", > path->drslid, path->drdlid, path->p[0]); >- if (rc >= dstr_size) >+ if (rc >= (int)dstr_size) > return dstr; > for (i = 1; i <= path->cnt; i++) { > rc += snprintf(dstr+rc, dstr_size-rc, ",%d", path->p[i]); >- if (rc >= dstr_size) >+ if (rc >= (int)dstr_size) > break; > } > return (dstr); >diff --git a/libibmad/src/register.c b/libibmad/src/register.c >index 045f840..ea916da 100644 >--- a/libibmad/src/register.c >+++ b/libibmad/src/register.c >@@ -37,12 +37,11 @@ > > #include > #include >-#include > #include > #include > >-#include > #include "mad.h" >+#include > > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN >diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c >index 906b28d..2e373e6 100644 >--- a/libibmad/src/resolve.c >+++ b/libibmad/src/resolve.c >@@ -37,11 +37,10 @@ > > #include > #include >-#include > #include > >-#include > #include >+#include > > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN >diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c >index 670a936..3bf3640 100644 >--- a/libibmad/src/rpc.c >+++ b/libibmad/src/rpc.c >@@ -37,12 +37,11 @@ > > #include > #include >-#include > #include > #include > >-#include > #include "mad.h" >+#include > > #define MAX_CLASS 256 > >@@ -129,7 +128,7 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int >agentid, int len, > save_mad = 0; > } > >- trid = mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); >+ trid = (uint32_t)mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); > > for (retries = 0; retries < madrpc_retries; retries++) { > if (retries) { >@@ -298,7 +297,7 @@ madrpc_init(char *dev_name, int dev_port, int >*mgmt_classes, int num_classes) > IBPANIC("too many classes %d requested", num_classes); > > while (num_classes--) { >- int rmpp_version = 0; >+ uint8_t rmpp_version = 0; > int mgmt = *mgmt_classes++; > > if (mgmt == IB_SA_CLASS) >@@ -343,7 +342,7 @@ mad_rpc_open_port(char *dev_name, int dev_port, > } > > while (num_classes--) { >- int rmpp_version = 0; >+ uint8_t rmpp_version = 0; > int mgmt = *mgmt_classes++; > int agent; > >diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c >index c601254..c08a392 100644 >--- a/libibmad/src/sa.c >+++ b/libibmad/src/sa.c >@@ -37,7 +37,6 @@ > > #include > #include >-#include > #include > > #include >diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c >index b329352..611a93f 100644 >--- a/libibmad/src/serv.c >+++ b/libibmad/src/serv.c >@@ -37,12 +37,10 @@ > > #include > #include >-#include > #include >-#include > >-#include > #include >+#include > > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN >diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c >index ad6b066..e872602 100644 >--- a/libibmad/src/smp.c >+++ b/libibmad/src/smp.c >@@ -37,7 +37,6 @@ > > #include > #include >-#include > #include > > #include >diff --git a/libibmad/src/vendor.c b/libibmad/src/vendor.c >index eb703f6..7928a58 100644 >--- a/libibmad/src/vendor.c >+++ b/libibmad/src/vendor.c >@@ -37,7 +37,6 @@ > > #include > #include >-#include > #include > > #include >-- >1.5.2.5 > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Mon Jan 19 10:42:14 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 19 Jan 2009 12:42:14 -0600 Subject: [ofa-general] QP needs a lot of memory In-Reply-To: <000001c97790$acad6310$82cd180a@amr.corp.intel.com> References: <000001c97790$acad6310$82cd180a@amr.corp.intel.com> Message-ID: <4974C986.1080502@opengridcomputing.com> Sean Hefty wrote: >> I am running OFED 1.4 on a Chelsio T3 RNIC. >> When I was trying to connect a large number of clients (several hundred), >> I noticed that the server was running out of memory. One instance of a >> whenever I do an rdma_create_qp(), I lose about 7MB of main memory. >> This severly limits the scalability of my application. >> >> Is there a reason for that? >> > > The amount of memory needed per QP is based on the send and receive queue sizes, > plus the number of SGEs. I don't know specific details about the Chelsio > adapter itself to know if 7MB is high or not. > > That seems very high. A T3 max depth QP should only use around 256KB of dma coherent memory. The max depth CQ should be around 256KB too. So something whacked if its consuming 7MB per QP... How are you measuring this memory usage? Steve. From vlad at lists.openfabrics.org Tue Jan 20 03:14:21 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 20 Jan 2009 03:14:21 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090120-0200 daily build status Message-ID: <20090120111422.0F17AE61083@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From dorfman.eli at gmail.com Tue Jan 20 05:56:52 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 20 Jan 2009 15:56:52 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 0/5] subnet configuration update Message-ID: <4975D824.6020607@gmail.com> The following patches are handling subnet configuration update. Subnet configuration parameters are rescanned every heavy sweep and if possible are updated. From dorfman.eli at gmail.com Tue Jan 20 06:02:12 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 20 Jan 2009 16:02:12 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 1/5] opensm/osm_opensm.[ch] make setup and destroy routing engines functions global. In-Reply-To: <4975D824.6020607@gmail.com> References: <4975D824.6020607@gmail.com> Message-ID: <4975D964.4020901@gmail.com> make setup and destroy routing engines functions global. change setup_routing_engines() and destroy_routing_engines() declaration Signed-off-by: Eli Dorfman --- opensm/include/opensm/osm_opensm.h | 53 ++++++++++++++++++++++++++++++++++++ opensm/opensm/osm_opensm.c | 5 ++- 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h index c121be4..5b0a1dd 100644 --- a/opensm/include/opensm/osm_opensm.h +++ b/opensm/include/opensm/osm_opensm.h @@ -458,6 +458,59 @@ osm_opensm_wait_for_subnet_up(IN osm_opensm_t * const p_osm, * SEE ALSO *********/ +/****f* OpenSM: OpenSM/setup_routing_engines +* NAME +* setup_routing_engines +* +* DESCRIPTION +* This function constructs an routing engines. +* +* SYNOPSIS +*/ +void setup_routing_engines(osm_opensm_t *osm, const char *name); +/* +* PARAMETERS +* p_osm +* [in] Pointer to a OpenSM object to construct. +* +* name +* [in] Routing engine names. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Setup of routing engines +* +* SEE ALSO +* destroy_routing_engines +*********/ + +/****f* OpenSM: OpenSM/destroy_routing_engines +* NAME +* destroy_routing_engines +* +* DESCRIPTION +* This function constructs an routing engines. +* +* SYNOPSIS +*/ +void destroy_routing_engines(osm_opensm_t *osm); +/* +* PARAMETERS +* p_osm +* [in] Pointer to a OpenSM object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Setup of routing engines +* +* SEE ALSO +* setup_routing_engines +*********/ + /****f* OpenSM: OpenSM/osm_routing_engine_type_str * NAME * osm_routing_engine_type_str diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c index 7de2e5b..8ecb942 100644 --- a/opensm/opensm/osm_opensm.c +++ b/opensm/opensm/osm_opensm.c @@ -186,7 +186,7 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name) "cannot find or setup routing engine \'%s\'", name); } -static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) +void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) { char *name, *str, *p; @@ -224,7 +224,7 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm) /********************************************************************** **********************************************************************/ -static void destroy_routing_engines(osm_opensm_t *osm) +void destroy_routing_engines(osm_opensm_t *osm) { struct osm_routing_engine *r, *next; @@ -236,6 +236,7 @@ static void destroy_routing_engines(osm_opensm_t *osm) r->delete(r->context); free(r); } + osm->routing_engine_list = NULL; } /********************************************************************** -- 1.5.5 From dorfman.eli at gmail.com Tue Jan 20 06:03:08 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 20 Jan 2009 16:03:08 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 2/5] opensm/main.c rescan subnet configuration after SIGHUP In-Reply-To: <4975D824.6020607@gmail.com> References: <4975D824.6020607@gmail.com> Message-ID: <4975D99C.2070601@gmail.com> rescan subnet configuration after SIGHUP call osm_subn_rescan_conf_files() after SIGHUP. this is important when priority is changed and SM is in standby. in that case it will not send capability mask trap and will not become master. Signed-off-by: Eli Dorfman --- opensm/opensm/main.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index f786192..0f7b822 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) osm_hup_flag = 0; /* a HUP signal should only start a new heavy sweep */ p_osm->subn.force_heavy_sweep = TRUE; + osm_subn_rescan_conf_files(&p_osm->subn); osm_opensm_sweep(p_osm); } } -- 1.5.5 From dorfman.eli at gmail.com Tue Jan 20 06:04:23 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 20 Jan 2009 16:04:23 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 3/5] opensm/osm_subnet.h put qos options flat below subnet opt In-Reply-To: <4975D824.6020607@gmail.com> References: <4975D824.6020607@gmail.com> Message-ID: <4975D9E7.4070604@gmail.com> put qos options flat below subnet opt put all qos option parameters (default, ca, sw, router) flat below subnet opt Signed-off-by: Eli Dorfman --- opensm/include/opensm/osm_subnet.h | 40 +++++++++++++++++++++++++++--------- 1 files changed, 30 insertions(+), 10 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 8863e47..692e449 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -99,11 +99,11 @@ struct osm_qos_policy; * SYNOPSIS */ typedef struct osm_qos_options { - unsigned max_vls; - int high_limit; - char *vlarb_high; - char *vlarb_low; - char *sl2vl; + unsigned qos_max_vls; + int qos_high_limit; + char *qos_vlarb_high; + char *qos_vlarb_low; + char *qos_sl2vl; } osm_qos_options_t; /* * FIELDS @@ -199,11 +199,31 @@ typedef struct osm_subn_opt { boolean_t daemon; boolean_t sm_inactive; boolean_t babbling_port_policy; - osm_qos_options_t qos_options; - osm_qos_options_t qos_ca_options; - osm_qos_options_t qos_sw0_options; - osm_qos_options_t qos_swe_options; - osm_qos_options_t qos_rtr_options; + unsigned qos_max_vls; + int qos_high_limit; + char *qos_vlarb_high; + char *qos_vlarb_low; + char *qos_sl2vl; + unsigned qos_ca_max_vls; + int qos_ca_high_limit; + char *qos_ca_vlarb_high; + char *qos_ca_vlarb_low; + char *qos_ca_sl2vl; + unsigned qos_sw0_max_vls; + int qos_sw0_high_limit; + char *qos_sw0_vlarb_high; + char *qos_sw0_vlarb_low; + char *qos_sw0_sl2vl; + unsigned qos_swe_max_vls; + int qos_swe_high_limit; + char *qos_swe_vlarb_high; + char *qos_swe_vlarb_low; + char *qos_swe_sl2vl; + unsigned qos_rtr_max_vls; + int qos_rtr_high_limit; + char *qos_rtr_vlarb_high; + char *qos_rtr_vlarb_low; + char *qos_rtr_sl2vl; boolean_t enable_quirks; boolean_t no_clients_rereg; #ifdef ENABLE_OSM_PERF_MGR -- 1.5.5 From dorfman.eli at gmail.com Tue Jan 20 06:05:24 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 20 Jan 2009 16:05:24 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 4/5] opensm/osm_qos.c support new arrangement of qos parameters In-Reply-To: <4975D824.6020607@gmail.com> References: <4975D824.6020607@gmail.com> Message-ID: <4975DA24.9050905@gmail.com> support new arrangement of qos parameters change functions to access qos parameters flat below subnet opt Signed-off-by: Eli Dorfman --- opensm/opensm/osm_qos.c | 46 +++++++++++++++++++++++++++++++--------------- 1 files changed, 31 insertions(+), 15 deletions(-) diff --git a/opensm/opensm/osm_qos.c b/opensm/opensm/osm_qos.c index b451c25..d5b62cf 100644 --- a/opensm/opensm/osm_qos.c +++ b/opensm/opensm/osm_qos.c @@ -261,6 +261,15 @@ static ib_api_status_t qos_physp_setup(osm_log_t * p_log, osm_sm_t * sm, return IB_SUCCESS; } +#define qos_init_local_opts(opt, subn_opt) \ +{ \ + opt ## _max_vls = subn_opt ## _max_vls; \ + opt ## _high_limit = subn_opt ## _high_limit; \ + opt ## _vlarb_high = subn_opt ## _vlarb_high; \ + opt ## _vlarb_low = subn_opt ## _vlarb_low; \ + opt ## _sl2vl = subn_opt ## _sl2vl; \ +} + osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) { struct qos_config ca_config, sw0_config, swe_config, rtr_config; @@ -274,20 +283,27 @@ osm_signal_t osm_qos_setup(osm_opensm_t * p_osm) ib_api_status_t status; unsigned force_update; uint8_t i; + osm_qos_options_t qos_options; + osm_qos_options_t qos_ca_options; + osm_qos_options_t qos_sw0_options; + osm_qos_options_t qos_swe_options; + osm_qos_options_t qos_rtr_options; if (!p_osm->subn.opt.qos) return OSM_SIGNAL_DONE; OSM_LOG_ENTER(&p_osm->log); - qos_build_config(&ca_config, &p_osm->subn.opt.qos_ca_options, - &p_osm->subn.opt.qos_options); - qos_build_config(&sw0_config, &p_osm->subn.opt.qos_sw0_options, - &p_osm->subn.opt.qos_options); - qos_build_config(&swe_config, &p_osm->subn.opt.qos_swe_options, - &p_osm->subn.opt.qos_options); - qos_build_config(&rtr_config, &p_osm->subn.opt.qos_rtr_options, - &p_osm->subn.opt.qos_options); + qos_init_local_opts(qos_options.qos, p_osm->subn.opt.qos); + qos_init_local_opts(qos_ca_options.qos, p_osm->subn.opt.qos_ca); + qos_init_local_opts(qos_sw0_options.qos, p_osm->subn.opt.qos_sw0); + qos_init_local_opts(qos_swe_options.qos, p_osm->subn.opt.qos_swe); + qos_init_local_opts(qos_rtr_options.qos, p_osm->subn.opt.qos_rtr); + + qos_build_config(&ca_config, &qos_ca_options, &qos_options); + qos_build_config(&sw0_config, &qos_sw0_options, &qos_options); + qos_build_config(&swe_config, &qos_swe_options, &qos_options); + qos_build_config(&rtr_config, &qos_rtr_options, &qos_options); cl_plock_excl_acquire(&p_osm->lock); @@ -381,14 +397,14 @@ static void qos_build_config(struct qos_config *cfg, memset(cfg, 0, sizeof(*cfg)); - cfg->max_vls = opt->max_vls > 0 ? opt->max_vls : dflt->max_vls; + cfg->max_vls = opt->qos_max_vls > 0 ? opt->qos_max_vls : dflt->qos_max_vls; - if (opt->high_limit >= 0) - cfg->vl_high_limit = (uint8_t) opt->high_limit; + if (opt->qos_high_limit >= 0) + cfg->vl_high_limit = (uint8_t) opt->qos_high_limit; else - cfg->vl_high_limit = (uint8_t) dflt->high_limit; + cfg->vl_high_limit = (uint8_t) dflt->qos_high_limit; - p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high; + p = opt->qos_vlarb_high ? opt->qos_vlarb_high : dflt->qos_vlarb_high; for (i = 0; i < 2 * IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; i++) { p += parse_vlarb_entry(p, &cfg->vlarb_high[i / @@ -397,7 +413,7 @@ static void qos_build_config(struct qos_config *cfg, IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK]); } - p = opt->vlarb_low ? opt->vlarb_low : dflt->vlarb_low; + p = opt->qos_vlarb_low ? opt->qos_vlarb_low : dflt->qos_vlarb_low; for (i = 0; i < 2 * IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK; i++) { p += parse_vlarb_entry(p, &cfg->vlarb_low[i / @@ -406,7 +422,7 @@ static void qos_build_config(struct qos_config *cfg, IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK]); } - p = opt->sl2vl ? opt->sl2vl : dflt->sl2vl; + p = opt->qos_sl2vl ? opt->qos_sl2vl : dflt->qos_sl2vl; for (i = 0; i < IB_MAX_NUM_VLS / 2; i++) p += parse_sl2vl_entry(p, &cfg->sl2vl.raw_vl_by_sl[i]); -- 1.5.5 From dorfman.eli at gmail.com Tue Jan 20 06:06:17 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Tue, 20 Jan 2009 16:06:17 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 5/5] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <4975D824.6020607@gmail.com> References: <4975D824.6020607@gmail.com> Message-ID: <4975DA59.6070309@gmail.com> support subnet configuration rescan and update subnet configuration parameters are rescanned every heavy sweep. every parameter is defined with an unpack function to parse its value from opensm configuration file. some params require special post update operation to apply them. every parameter has also a flag that specifies whether it can be updated. Signed-off-by: Eli Dorfman --- opensm/opensm/osm_subnet.c | 801 +++++++++++++++++++++----------------------- 1 files changed, 381 insertions(+), 420 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index a6db304..fe1bbda 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -71,6 +71,183 @@ static const char null_str[] = "(null)"; +#define OPT_OFFSET(member) offsetof(osm_subn_opt_t, member) + +//typedef char *(op_fn_t)(ib_portid_t *dest, char **argv, int argc); +typedef void (update_fn_t)(osm_subn_t *p_subn, void *p_val); +typedef void (unpack_fn_t)(osm_subn_t *p_subn, char *p_key, char *p_val_str, void *p_val, update_fn_t *f); + +typedef struct opt_rec { + char *name; + int field_offset; + unpack_fn_t *unpack_fn; + update_fn_t *update_fn; + int can_update; +} opt_rec_t; + +static unpack_fn_t opts_unpack_uint8, opts_unpack_uint16, opts_unpack_net16, opts_unpack_uint32, + opts_unpack_int32, opts_unpack_net64, opts_unpack_charp, opts_unpack_boolean; + +static update_fn_t opts_update_log_max_size, opts_update_sminfo_polling_timeout, + opts_update_routing_engine, opts_update_sm_priority; + +static const opt_rec_t opt_tbl[] = { + { "guid", OPT_OFFSET(guid), opts_unpack_net64, NULL, 0 }, + { "m_key", OPT_OFFSET(m_key), opts_unpack_net64, NULL, 1 }, + { "sm_key", OPT_OFFSET(sm_key), opts_unpack_net64, NULL, 1 }, + { "sa_key", OPT_OFFSET(sa_key), opts_unpack_net64, NULL, 1 }, + { "subnet_prefix", OPT_OFFSET(subnet_prefix), opts_unpack_net64, NULL, 1 }, + { "m_key_lease_period", OPT_OFFSET(m_key_lease_period), opts_unpack_net16, NULL, 1 }, + { "sweep_interval", OPT_OFFSET(sweep_interval), opts_unpack_uint32, NULL, 1 }, + { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_unpack_uint32, NULL, 1 }, + { "console", OPT_OFFSET(console), opts_unpack_charp, NULL, 0 }, + { "console_port", OPT_OFFSET(console_port), opts_unpack_uint16, NULL, 0 }, + { "transaction_timeout", OPT_OFFSET(transaction_timeout), opts_unpack_uint32, NULL, 1 }, + { "max_msg_fifo_timeout", OPT_OFFSET(max_msg_fifo_timeout), opts_unpack_uint32, NULL, 1 }, + { "sm_priority", OPT_OFFSET(sm_priority), opts_unpack_uint8, opts_update_sm_priority, 1 }, + { "lmc", OPT_OFFSET(lmc), opts_unpack_uint8, NULL, 1 }, + { "lmc_esp0", OPT_OFFSET(lmc_esp0), opts_unpack_boolean, NULL, 1 }, + { "max_op_vls", OPT_OFFSET(max_op_vls), opts_unpack_uint8, NULL, 1 }, + { "force_link_speed", OPT_OFFSET(force_link_speed), opts_unpack_uint8, NULL, 1 }, + { "reassign_lids", OPT_OFFSET(reassign_lids), opts_unpack_boolean, NULL, 1 }, + { "ignore_other_sm", OPT_OFFSET(ignore_other_sm), opts_unpack_boolean, NULL, 1 }, + { "single_thread", OPT_OFFSET(single_thread), opts_unpack_boolean, NULL, 0 }, + { "disable_multicast", OPT_OFFSET(disable_multicast), opts_unpack_boolean, NULL, 1 }, + { "force_log_flush", OPT_OFFSET(force_log_flush), opts_unpack_boolean, NULL, 1 }, + { "subnet_timeout", OPT_OFFSET(subnet_timeout), opts_unpack_uint8, NULL, 1 }, + { "packet_life_time", OPT_OFFSET(packet_life_time), opts_unpack_uint8, NULL, 1 }, + { "vl_stall_count", OPT_OFFSET(vl_stall_count), opts_unpack_uint8, NULL, 1 }, + { "leaf_vl_stall_count", OPT_OFFSET(leaf_vl_stall_count), opts_unpack_uint8, NULL, 1 }, + { "head_of_queue_lifetime", OPT_OFFSET(head_of_queue_lifetime), opts_unpack_uint8, NULL, 1 }, + { "leaf_head_of_queue_lifetime", OPT_OFFSET(leaf_head_of_queue_lifetime), opts_unpack_uint8, NULL, 1 }, + { "local_phy_errors_threshold", OPT_OFFSET(local_phy_errors_threshold), opts_unpack_uint8, NULL, 1 }, + { "overrun_errors_threshold", OPT_OFFSET(overrun_errors_threshold), opts_unpack_uint8, NULL, 1 }, + { "sminfo_polling_timeout", OPT_OFFSET(sminfo_polling_timeout), opts_unpack_uint32, opts_update_sminfo_polling_timeout, 1 }, + { "polling_retry_number", OPT_OFFSET(polling_retry_number), opts_unpack_uint32, NULL, 1 }, + { "force_heavy_sweep", OPT_OFFSET(force_heavy_sweep), opts_unpack_boolean, NULL, 1 }, + { "log_flags", OPT_OFFSET(log_flags), opts_unpack_uint8, NULL, 1 }, + { "port_prof_ignore_file", OPT_OFFSET(port_prof_ignore_file), opts_unpack_charp, NULL, 1 }, + { "port_profile_switch_nodes", OPT_OFFSET(port_profile_switch_nodes), opts_unpack_boolean, NULL, 1 }, + { "sweep_on_trap", OPT_OFFSET(sweep_on_trap), opts_unpack_boolean, NULL, 1 }, + { "routing_engine", OPT_OFFSET(routing_engine_names), opts_unpack_charp, opts_update_routing_engine, 1 }, + { "connect_roots", OPT_OFFSET(connect_roots), opts_unpack_boolean, NULL, 1 }, + { "use_ucast_cache", OPT_OFFSET(use_ucast_cache), opts_unpack_boolean, NULL, 1 }, + { "log_file", OPT_OFFSET(log_file), opts_unpack_charp, NULL, 0 }, + { "log_max_size", OPT_OFFSET(log_max_size), opts_unpack_uint32, opts_update_log_max_size }, + { "partition_config_file", OPT_OFFSET(partition_config_file), opts_unpack_charp, NULL, 1 }, + { "no_partition_enforcement", OPT_OFFSET(no_partition_enforcement), opts_unpack_boolean, NULL, 1 }, + { "qos", OPT_OFFSET(qos), opts_unpack_boolean, NULL, 1 }, + { "qos_policy_file", OPT_OFFSET(qos_policy_file), opts_unpack_charp, NULL, 1 }, + { "accum_log_file", OPT_OFFSET(accum_log_file), opts_unpack_boolean, NULL, 1 }, + { "dump_files_dir", OPT_OFFSET(dump_files_dir), opts_unpack_charp, NULL, 1 }, + { "lid_matrix_dump_file", OPT_OFFSET(lid_matrix_dump_file), opts_unpack_charp, NULL, 1 }, + { "lfts_file", OPT_OFFSET(lfts_file), opts_unpack_charp, NULL, 1 }, + { "root_guid_file", OPT_OFFSET(root_guid_file), opts_unpack_charp, NULL, 1 }, + { "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_unpack_charp, NULL, 1 }, + { "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_unpack_charp, NULL, 1 }, + { "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file), opts_unpack_charp, NULL, 1 }, + { "sa_db_file", OPT_OFFSET(sa_db_file), opts_unpack_charp, NULL, 1 }, + { "do_mesh_analysis", OPT_OFFSET(do_mesh_analysis), opts_unpack_boolean, NULL, 1 }, + { "exit_on_fatal", OPT_OFFSET(exit_on_fatal), opts_unpack_boolean, NULL, 1 }, + { "honor_guid2lid_file", OPT_OFFSET(honor_guid2lid_file), opts_unpack_boolean, NULL, 1 }, + { "daemon", OPT_OFFSET(daemon), opts_unpack_boolean, NULL, 0 }, + { "sm_inactive", OPT_OFFSET(sm_inactive), opts_unpack_boolean, NULL, 1 }, + { "babbling_port_policy", OPT_OFFSET(babbling_port_policy), opts_unpack_boolean, NULL, 1 }, + +#ifdef ENABLE_OSM_PERF_MGR + { "perfmgr", OPT_OFFSET(perfmgr), opts_unpack_boolean, NULL, 0 }, + { "perfmgr_redir", OPT_OFFSET(perfmgr_redir), opts_unpack_boolean NULL, 0 }, + { "perfmgr_sweep_time_s", OPT_OFFSET(perfmgr_sweep_time_s), opts_unpack_uint16, NULL, 0 }, + { "perfmgr_max_outstanding_queries", OPT_OFFSET(perfmgr_max_outstanding_queries), opts_unpack_uint32, NULL, 0 }, + { "event_db_dump_file", OPT_OFFSET(event_db_dump_file), opts_unpack_charp, NULL, 0 }, +#endif /* ENABLE_OSM_PERF_MGR */ + + { "event_plugin_name", OPT_OFFSET(event_plugin_name), opts_unpack_charp, NULL, 0 }, + { "node_name_map_name", OPT_OFFSET(node_name_map_name), opts_unpack_charp, NULL, 0 }, + + { "qos_max_vls", OPT_OFFSET(qos_max_vls), opts_unpack_uint32, NULL, 1 }, + { "qos_high_limit", OPT_OFFSET(qos_high_limit), opts_unpack_int32, NULL, 1 }, + { "qos_vlarb_high", OPT_OFFSET(qos_vlarb_high), opts_unpack_charp, NULL, 1 }, + { "qos_vlarb_low", OPT_OFFSET(qos_vlarb_low), opts_unpack_charp, NULL, 1 }, + { "qos_sl2vl", OPT_OFFSET(qos_sl2vl), opts_unpack_charp, NULL, 1 }, + + { "qos_ca_max_vls", OPT_OFFSET(qos_ca_max_vls), opts_unpack_uint32, NULL, 1 }, + { "qos_ca_high_limit", OPT_OFFSET(qos_ca_high_limit), opts_unpack_int32, NULL, 1 }, + { "qos_ca_vlarb_high", OPT_OFFSET(qos_ca_vlarb_high), opts_unpack_charp, NULL, 1 }, + { "qos_ca_vlarb_low", OPT_OFFSET(qos_ca_vlarb_low), opts_unpack_charp, NULL, 1 }, + { "qos_ca_sl2vl", OPT_OFFSET(qos_ca_sl2vl), opts_unpack_charp, NULL, 1 }, + + { "qos_sw0_max_vls", OPT_OFFSET(qos_sw0_max_vls), opts_unpack_uint32, NULL, 1 }, + { "qos_sw0_high_limit", OPT_OFFSET(qos_sw0_high_limit), opts_unpack_int32, NULL, 1 }, + { "qos_sw0_vlarb_high", OPT_OFFSET(qos_sw0_vlarb_high), opts_unpack_charp, NULL, 1 }, + { "qos_sw0_vlarb_low", OPT_OFFSET(qos_sw0_vlarb_low), opts_unpack_charp, NULL, 1 }, + { "qos_sw0_sl2vl", OPT_OFFSET(qos_sw0_sl2vl), opts_unpack_charp, NULL, 1 }, + + { "qos_swe_max_vls", OPT_OFFSET(qos_swe_max_vls), opts_unpack_uint32, NULL, 1 }, + { "qos_swe_high_limit", OPT_OFFSET(qos_swe_high_limit), opts_unpack_int32, NULL, 1 }, + { "qos_swe_vlarb_high", OPT_OFFSET(qos_swe_vlarb_high), opts_unpack_charp, NULL, 1 }, + { "qos_swe_vlarb_low", OPT_OFFSET(qos_swe_vlarb_low), opts_unpack_charp, NULL, 1 }, + { "qos_swe_sl2vl", OPT_OFFSET(qos_swe_sl2vl), opts_unpack_charp, NULL, 1 }, + + { "qos_rtr_max_vls", OPT_OFFSET(qos_rtr_max_vls), opts_unpack_uint32, NULL, 1 }, + { "qos_rtr_high_limit", OPT_OFFSET(qos_rtr_high_limit), opts_unpack_int32, NULL, 1 }, + { "qos_rtr_vlarb_high", OPT_OFFSET(qos_rtr_vlarb_high), opts_unpack_charp, NULL, 1 }, + { "qos_rtr_vlarb_low", OPT_OFFSET(qos_rtr_vlarb_low), opts_unpack_charp, NULL, 1 }, + { "qos_rtr_sl2vl", OPT_OFFSET(qos_rtr_sl2vl), opts_unpack_charp, NULL, 1 }, + + { "enable_quirks", OPT_OFFSET(enable_quirks), opts_unpack_boolean, NULL, 1 }, + { "no_clients_rereg", OPT_OFFSET(no_clients_rereg), opts_unpack_boolean, NULL, 1 }, + { "prefix_routes_file", OPT_OFFSET(prefix_routes_file), opts_unpack_charp, NULL, 1 }, + { "consolidate_ipv6_snm_req", OPT_OFFSET(consolidate_ipv6_snm_req), opts_unpack_boolean, NULL, 1 }, + {0} +}; + +static void opts_update_log_max_size(osm_subn_t *p_subn, void *p_val) +{ + uint32_t log_max_size = *((uint32_t *) p_val); + + if (!p_subn) + return; + + p_subn->opt.log_max_size = log_max_size << 20; /* convert from MB to Bytes */ +} + +static void opts_update_sminfo_polling_timeout(osm_subn_t *p_subn, void *p_val) +{ + osm_sm_t *p_sm; + uint32_t sminfo_polling_timeout = *((uint32_t *) p_val); + + if (!p_subn) + return; + + p_sm = &p_subn->p_osm->sm; + cl_timer_stop(&p_sm->polling_timer); + cl_timer_start(&p_sm->polling_timer, sminfo_polling_timeout); +} + +static void opts_update_routing_engine(osm_subn_t *p_subn, void *p_val) +{ + char *routing_engine_names = (char *) p_val; + + if (!p_subn) + return; + + destroy_routing_engines(p_subn->p_osm); + setup_routing_engines(p_subn->p_osm, routing_engine_names); +} + +static void opts_update_sm_priority(osm_subn_t *p_subn, void *p_val) +{ + osm_sm_t *p_sm; + uint8_t sm_priority = *((uint8_t *) p_val); + + if (!p_subn) + return; + + p_sm = &p_subn->p_osm->sm; + osm_set_sm_priority(p_sm, sm_priority); +} + /********************************************************************** **********************************************************************/ void osm_subn_construct(IN osm_subn_t * const p_subn) @@ -315,32 +492,30 @@ osm_port_t *osm_get_port_by_guid(IN osm_subn_t const *p_subn, IN ib_net64_t guid **********************************************************************/ static void subn_set_default_qos_options(IN osm_qos_options_t * opt) { - opt->max_vls = OSM_DEFAULT_QOS_MAX_VLS; - opt->high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT; - opt->vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH; - opt->vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW; - opt->sl2vl = OSM_DEFAULT_QOS_SL2VL; + opt->qos_max_vls = OSM_DEFAULT_QOS_MAX_VLS; + opt->qos_high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT; + opt->qos_vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH; + opt->qos_vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW; + opt->qos_sl2vl = OSM_DEFAULT_QOS_SL2VL; } -static void subn_init_qos_options(IN osm_qos_options_t * opt) -{ - opt->max_vls = 0; - opt->high_limit = -1; - opt->vlarb_high = NULL; - opt->vlarb_low = NULL; - opt->sl2vl = NULL; +#define subn_init_qos_options(opt) \ +{ \ + opt ## _max_vls = 0; \ + opt ## _high_limit = -1; \ + opt ## _vlarb_high = NULL; \ + opt ## _vlarb_low = NULL; \ + opt ## _sl2vl = NULL; \ } -static void subn_free_qos_options(IN osm_qos_options_t * opt) -{ - if (opt->vlarb_high && opt->vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH) - free(opt->vlarb_high); - - if (opt->vlarb_low && opt->vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW) - free(opt->vlarb_low); - - if (opt->sl2vl && opt->sl2vl != OSM_DEFAULT_QOS_SL2VL) - free(opt->sl2vl); +#define subn_free_qos_options(opt) \ +{ \ + if (opt ## _vlarb_high && opt ## _vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH) \ + free(opt ## _vlarb_high); \ + if (opt ## _vlarb_low && opt ## _vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW) \ + free(opt ## _vlarb_low); \ + if (opt ## _sl2vl && opt ## _sl2vl != OSM_DEFAULT_QOS_SL2VL) \ + free(opt ## _sl2vl); \ } /********************************************************************** @@ -431,11 +606,11 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->no_clients_rereg = FALSE; p_opt->prefix_routes_file = OSM_DEFAULT_PREFIX_ROUTES_FILE; p_opt->consolidate_ipv6_snm_req = FALSE; - subn_init_qos_options(&p_opt->qos_options); - subn_init_qos_options(&p_opt->qos_ca_options); - subn_init_qos_options(&p_opt->qos_sw0_options); - subn_init_qos_options(&p_opt->qos_swe_options); - subn_init_qos_options(&p_opt->qos_rtr_options); + subn_init_qos_options(p_opt->qos); + subn_init_qos_options(p_opt->qos_ca); + subn_init_qos_options(p_opt->qos_sw0); + subn_init_qos_options(p_opt->qos_swe); + subn_init_qos_options(p_opt->qos_rtr); } /********************************************************************** @@ -470,137 +645,167 @@ static void log_config_value(char *name, const char *fmt, ...) } static void -opts_unpack_net64(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint64_t * p_val) +opts_unpack_net64(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN update_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint64_t val = strtoull(p_val_str, NULL, 0); - if (cl_hton64(val) != *p_val) { - log_config_value(p_key, "0x%016" PRIx64, val); - *p_val = cl_ntoh64(val); - } + uint64_t *p_val = (uint64_t *) p_v; + uint64_t val = strtoull(p_val_str, NULL, 0); + + if (cl_hton64(val) != *p_val) { + log_config_value(p_key, "0x%016" PRIx64, val); + if (pfn) + pfn(p_subn, &val); + *p_val = cl_ntoh64(val); } } /********************************************************************** **********************************************************************/ static void -opts_unpack_uint32(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint32_t * p_val) +opts_unpack_uint32(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN update_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint32_t val = strtoul(p_val_str, NULL, 0); - if (val != *p_val) { - log_config_value(p_key, "%u", val); - *p_val = val; - } + uint32_t *p_val = (uint32_t *) p_v; + uint32_t val = strtoul(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_int32(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN int32_t * p_val) +opts_unpack_int32(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN update_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - int32_t val = strtol(p_val_str, NULL, 0); - if (val != *p_val) { - log_config_value(p_key, "%d", val); - *p_val = val; - } + int32_t *p_val = (int32_t *) p_v; + int32_t val = strtol(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%d", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_uint16(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint16_t * p_val) +opts_unpack_uint16(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN update_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); - if (val != *p_val) { - log_config_value(p_key, "%u", val); - *p_val = val; - } + uint16_t *p_val = (uint16_t *) p_v; + uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_net16(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint16_t * p_val) +opts_unpack_net16(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN update_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint32_t val; - val = strtoul(p_val_str, NULL, 0); - CL_ASSERT(val < 0x10000); - if (cl_hton32(val) != *p_val) { - log_config_value(p_key, "0x%04x", val); - *p_val = cl_hton16((uint16_t) val); - } + uint16_t *p_val = (uint16_t *) p_v; + uint32_t val = strtoul(p_val_str, NULL, 0); + + CL_ASSERT(val < 0x10000); + if (cl_hton32(val) != *p_val) { + log_config_value(p_key, "0x%04x", val); + if (pfn) + pfn(p_subn, &val); + *p_val = cl_hton16((uint16_t) val); } } /********************************************************************** **********************************************************************/ static void -opts_unpack_uint8(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint8_t * p_val) +opts_unpack_uint8(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN update_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint32_t val; - val = strtoul(p_val_str, NULL, 0); - CL_ASSERT(val < 0x100); - if (val != *p_val) { - log_config_value(p_key, "%u", val); - *p_val = (uint8_t) val; - } + uint8_t *p_val = (uint8_t *) p_v; + uint32_t val = strtoul(p_val_str, NULL, 0); + + CL_ASSERT(val < 0x100); + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = (uint8_t) val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_boolean(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN boolean_t * p_val) +opts_unpack_boolean(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN update_fn_t pfn) { - if (!strcmp(p_req_key, p_key) && p_val_str) { - boolean_t val; - if (strcmp("TRUE", p_val_str)) - val = FALSE; - else - val = TRUE; - - if (val != *p_val) { - log_config_value(p_key, "%s", p_val_str); - *p_val = val; - } + boolean_t *p_val = (boolean_t *) p_v; + boolean_t val; + + if (!p_val_str) + return; + + if (strcmp("TRUE", p_val_str)) + val = FALSE; + else + val = TRUE; + + if (val != *p_val) { + log_config_value(p_key, "%s", p_val_str); + if (pfn) + pfn(p_subn, &val); + *p_val = val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_charp(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN char **p_val) +opts_unpack_charp(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN update_fn_t pfn) { - if (!strcmp(p_req_key, p_key) && p_val_str) { - const char *current_str = *p_val ? *p_val : null_str ; - if (strcmp(p_val_str, current_str)) { - log_config_value(p_key, "%s", p_val_str); - /* special case the "(null)" string */ - if (strcmp(null_str, p_val_str) == 0) { - *p_val = NULL; - } else { - /* - Ignore the possible memory leak here; - the pointer may be to a static default. - */ - *p_val = strdup(p_val_str); - } + char **p_val = (char **) p_v; + const char *current_str = *p_val ? *p_val : null_str ; + + if (!p_val_str) + return; + + if (strcmp(p_val_str, current_str)) { + log_config_value(p_key, "%s", p_val_str); + /* special case the "(null)" string */ + if (strcmp(null_str, p_val_str) == 0) { + if (pfn) + pfn(p_subn, NULL); + *p_val = NULL; + } else { + if (pfn) + pfn(p_subn, p_val_str); + /* + Ignore the possible memory leak here; + the pointer may be to a static default. + */ + *p_val = strdup(p_val_str); } } } @@ -631,41 +836,20 @@ static char *clean_val(char *val) /********************************************************************** **********************************************************************/ -static void -subn_parse_qos_options(IN const char *prefix, - IN char *p_key, - IN char *p_val_str, IN osm_qos_options_t * opt) -{ - char name[256]; - - snprintf(name, sizeof(name), "%s_max_vls", prefix); - opts_unpack_uint32(name, p_key, p_val_str, &opt->max_vls); - snprintf(name, sizeof(name), "%s_high_limit", prefix); - opts_unpack_int32(name, p_key, p_val_str, &opt->high_limit); - snprintf(name, sizeof(name), "%s_vlarb_high", prefix); - opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_high); - snprintf(name, sizeof(name), "%s_vlarb_low", prefix); - opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_low); - snprintf(name, sizeof(name), "%s_sl2vl", prefix); - opts_unpack_charp(name, p_key, p_val_str, &opt->sl2vl); -} - -static int -subn_dump_qos_options(FILE * file, - const char *set_name, - const char *prefix, osm_qos_options_t * opt) -{ - return fprintf(file, "# %s\n" - "%s_max_vls %u\n" - "%s_high_limit %d\n" - "%s_vlarb_high %s\n" - "%s_vlarb_low %s\n" - "%s_sl2vl %s\n", - set_name, - prefix, opt->max_vls, - prefix, opt->high_limit, - prefix, opt->vlarb_high, - prefix, opt->vlarb_low, prefix, opt->sl2vl); +#define subn_dump_qos_options(file, name, prefix, opt) \ +{ \ + fprintf(file, "# %s\n" \ + "%s_max_vls %u\n" \ + "%s_high_limit %d\n" \ + "%s_vlarb_high %s\n" \ + "%s_vlarb_low %s\n" \ + "%s_sl2vl %s\n", \ + name, \ + prefix, opt ## _max_vls, \ + prefix, opt ## _high_limit, \ + prefix, opt ## _vlarb_high, \ + prefix, opt ## _vlarb_low, \ + prefix, opt ## _sl2vl); \ } /********************************************************************** @@ -904,14 +1088,13 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt) free(str); } -static void subn_verify_qos_set(osm_qos_options_t *set, const char *prefix, - osm_qos_options_t *dflt) -{ - subn_verify_max_vls(&set->max_vls, prefix, dflt->max_vls); - subn_verify_high_limit(&set->high_limit, prefix, dflt->high_limit); - subn_verify_vlarb(&set->vlarb_low, prefix, "low", dflt->vlarb_low); - subn_verify_vlarb(&set->vlarb_high, prefix, "high", dflt->vlarb_high); - subn_verify_sl2vl(&set->sl2vl, prefix, dflt->sl2vl); +#define subn_verify_qos_set(set, prefix, dflt) \ +{ \ + subn_verify_max_vls(&set ## _max_vls, prefix, dflt ## _max_vls); \ + subn_verify_high_limit(&set ## _high_limit, prefix, dflt ## _high_limit); \ + subn_verify_vlarb(&set ## _vlarb_low, prefix, "low", dflt ## _vlarb_low); \ + subn_verify_vlarb(&set ## _vlarb_high, prefix, "high", dflt ## _vlarb_high); \ + subn_verify_sl2vl(&set ## _sl2vl, prefix, dflt ## _sl2vl); \ } int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) @@ -960,15 +1143,11 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) subn_set_default_qos_options(&dflt); - subn_verify_qos_set(&p_opts->qos_options, "qos", &dflt); - subn_verify_qos_set(&p_opts->qos_ca_options, "qos_ca", - &p_opts->qos_options); - subn_verify_qos_set(&p_opts->qos_sw0_options, "qos_sw0", - &p_opts->qos_options); - subn_verify_qos_set(&p_opts->qos_swe_options, "qos_swe", - &p_opts->qos_options); - subn_verify_qos_set(&p_opts->qos_rtr_options, "qos_rtr", - &p_opts->qos_options); + subn_verify_qos_set(p_opts->qos, "qos", dflt.qos); + subn_verify_qos_set(p_opts->qos_ca, "qos_ca", p_opts->qos); + subn_verify_qos_set(p_opts->qos_sw0, "qos_sw0", p_opts->qos); + subn_verify_qos_set(p_opts->qos_swe, "qos_swe", p_opts->qos); + subn_verify_qos_set(p_opts->qos_rtr, "qos_rtr", p_opts->qos); } #ifdef ENABLE_OSM_PERF_MGR @@ -1000,6 +1179,8 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) char line[1024]; FILE *opts_file; char *p_key, *p_val; + const opt_rec_t *r; + char *p_field; opts_file = fopen(file_name, "r"); if (!opts_file) { @@ -1023,231 +1204,14 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) p_val = clean_val(p_val); - opts_unpack_net64("guid", p_key, p_val, &p_opts->guid); - - opts_unpack_net64("m_key", p_key, p_val, &p_opts->m_key); - - opts_unpack_net64("sm_key", p_key, p_val, &p_opts->sm_key); - - opts_unpack_net64("sa_key", p_key, p_val, &p_opts->sa_key); - - opts_unpack_net64("subnet_prefix", - p_key, p_val, &p_opts->subnet_prefix); - - opts_unpack_net16("m_key_lease_period", - p_key, p_val, &p_opts->m_key_lease_period); - - opts_unpack_uint32("sweep_interval", - p_key, p_val, &p_opts->sweep_interval); - - opts_unpack_uint32("max_wire_smps", - p_key, p_val, &p_opts->max_wire_smps); - - opts_unpack_charp("console", p_key, p_val, &p_opts->console); - - opts_unpack_uint16("console_port", - p_key, p_val, &p_opts->console_port); - - opts_unpack_uint32("transaction_timeout", - p_key, p_val, &p_opts->transaction_timeout); - - opts_unpack_uint32("max_msg_fifo_timeout", - p_key, p_val, &p_opts->max_msg_fifo_timeout); - - opts_unpack_uint8("sm_priority", - p_key, p_val, &p_opts->sm_priority); - - opts_unpack_uint8("lmc", p_key, p_val, &p_opts->lmc); - - opts_unpack_boolean("lmc_esp0", - p_key, p_val, &p_opts->lmc_esp0); - - opts_unpack_uint8("max_op_vls", - p_key, p_val, &p_opts->max_op_vls); - - opts_unpack_uint8("force_link_speed", - p_key, p_val, &p_opts->force_link_speed); - - opts_unpack_boolean("reassign_lids", - p_key, p_val, &p_opts->reassign_lids); - - opts_unpack_boolean("ignore_other_sm", - p_key, p_val, &p_opts->ignore_other_sm); - - opts_unpack_boolean("single_thread", - p_key, p_val, &p_opts->single_thread); - - opts_unpack_boolean("disable_multicast", - p_key, p_val, &p_opts->disable_multicast); - - opts_unpack_boolean("force_log_flush", - p_key, p_val, &p_opts->force_log_flush); - - opts_unpack_uint8("subnet_timeout", - p_key, p_val, &p_opts->subnet_timeout); - - opts_unpack_uint8("packet_life_time", - p_key, p_val, &p_opts->packet_life_time); - - opts_unpack_uint8("vl_stall_count", - p_key, p_val, &p_opts->vl_stall_count); - - opts_unpack_uint8("leaf_vl_stall_count", - p_key, p_val, &p_opts->leaf_vl_stall_count); - - opts_unpack_uint8("head_of_queue_lifetime", p_key, p_val, - &p_opts->head_of_queue_lifetime); - - opts_unpack_uint8("leaf_head_of_queue_lifetime", p_key, p_val, - &p_opts->leaf_head_of_queue_lifetime); - - opts_unpack_uint8("local_phy_errors_threshold", p_key, p_val, - &p_opts->local_phy_errors_threshold); - - opts_unpack_uint8("overrun_errors_threshold", p_key, p_val, - &p_opts->overrun_errors_threshold); - - opts_unpack_uint32("sminfo_polling_timeout", p_key, p_val, - &p_opts->sminfo_polling_timeout); - - opts_unpack_uint32("polling_retry_number", - p_key, p_val, &p_opts->polling_retry_number); - - opts_unpack_boolean("force_heavy_sweep", - p_key, p_val, &p_opts->force_heavy_sweep); - - opts_unpack_uint8("log_flags", - p_key, p_val, &p_opts->log_flags); - - opts_unpack_charp("port_prof_ignore_file", p_key, p_val, - &p_opts->port_prof_ignore_file); - - opts_unpack_boolean("port_profile_switch_nodes", p_key, p_val, - &p_opts->port_profile_switch_nodes); - - opts_unpack_boolean("sweep_on_trap", - p_key, p_val, &p_opts->sweep_on_trap); - - opts_unpack_charp("routing_engine", - p_key, p_val, &p_opts->routing_engine_names); - - opts_unpack_boolean("connect_roots", - p_key, p_val, &p_opts->connect_roots); - - opts_unpack_boolean("use_ucast_cache", - p_key, p_val, &p_opts->use_ucast_cache); - - opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file); - - opts_unpack_uint32("log_max_size", p_key, p_val, - (void *) & p_opts->log_max_size); - p_opts->log_max_size *= 1024 * 1024; /* convert to MB */ - - opts_unpack_charp("partition_config_file", - p_key, p_val, &p_opts->partition_config_file); - - opts_unpack_boolean("no_partition_enforcement", p_key, p_val, - &p_opts->no_partition_enforcement); - - opts_unpack_boolean("qos", p_key, p_val, &p_opts->qos); - - opts_unpack_charp("qos_policy_file", - p_key, p_val, &p_opts->qos_policy_file); - - opts_unpack_boolean("accum_log_file", - p_key, p_val, &p_opts->accum_log_file); - - opts_unpack_charp("dump_files_dir", - p_key, p_val, &p_opts->dump_files_dir); - - opts_unpack_charp("lid_matrix_dump_file", - p_key, p_val, &p_opts->lid_matrix_dump_file); - - opts_unpack_charp("lfts_file", - p_key, p_val, &p_opts->lfts_file); - - opts_unpack_charp("root_guid_file", - p_key, p_val, &p_opts->root_guid_file); - - opts_unpack_charp("cn_guid_file", - p_key, p_val, &p_opts->cn_guid_file); - - opts_unpack_charp("ids_guid_file", - p_key, p_val, &p_opts->ids_guid_file); - - opts_unpack_charp("guid_routing_order_file", p_key, p_val, - &p_opts->guid_routing_order_file); - - opts_unpack_charp("sa_db_file", - p_key, p_val, &p_opts->sa_db_file); - - opts_unpack_boolean("do_mesh_analysis", - p_key, p_val, &p_opts->do_mesh_analysis); - - opts_unpack_boolean("exit_on_fatal", - p_key, p_val, &p_opts->exit_on_fatal); - - opts_unpack_boolean("honor_guid2lid_file", - p_key, p_val, &p_opts->honor_guid2lid_file); - - opts_unpack_boolean("daemon", p_key, p_val, &p_opts->daemon); - - opts_unpack_boolean("sm_inactive", - p_key, p_val, &p_opts->sm_inactive); - - opts_unpack_boolean("babbling_port_policy", - p_key, p_val, - &p_opts->babbling_port_policy); - -#ifdef ENABLE_OSM_PERF_MGR - opts_unpack_boolean("perfmgr", p_key, p_val, &p_opts->perfmgr); - - opts_unpack_boolean("perfmgr_redir", - p_key, p_val, &p_opts->perfmgr_redir); - - opts_unpack_uint16("perfmgr_sweep_time_s", - p_key, p_val, &p_opts->perfmgr_sweep_time_s); - - opts_unpack_uint32("perfmgr_max_outstanding_queries", - p_key, p_val, - &p_opts->perfmgr_max_outstanding_queries); - - opts_unpack_charp("event_db_dump_file", - p_key, p_val, &p_opts->event_db_dump_file); -#endif /* ENABLE_OSM_PERF_MGR */ - - opts_unpack_charp("event_plugin_name", - p_key, p_val, &p_opts->event_plugin_name); - - opts_unpack_charp("node_name_map_name", - p_key, p_val, &p_opts->node_name_map_name); - - subn_parse_qos_options("qos", - p_key, p_val, &p_opts->qos_options); - - subn_parse_qos_options("qos_ca", - p_key, p_val, &p_opts->qos_ca_options); - - subn_parse_qos_options("qos_sw0", - p_key, p_val, &p_opts->qos_sw0_options); - - subn_parse_qos_options("qos_swe", - p_key, p_val, &p_opts->qos_swe_options); - - subn_parse_qos_options("qos_rtr", - p_key, p_val, &p_opts->qos_rtr_options); - - opts_unpack_boolean("enable_quirks", - p_key, p_val, &p_opts->enable_quirks); - - opts_unpack_boolean("no_clients_rereg", - p_key, p_val, &p_opts->no_clients_rereg); - - opts_unpack_charp("prefix_routes_file", - p_key, p_val, &p_opts->prefix_routes_file); + for (r = opt_tbl; r->name; r++) { + if (strcmp(r->name, p_key)) + continue; - opts_unpack_boolean("consolidate_ipv6_snm_req", p_key, p_val, - &p_opts->consolidate_ipv6_snm_req); + p_field = (char *)p_opts + r->field_offset; + r->unpack_fn(NULL, p_key, + p_val, p_field, r->update_fn); + } } fclose(opts_file); @@ -1258,61 +1222,58 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) { + osm_subn_opt_t *p_opts = &p_subn->opt; + const opt_rec_t *r; FILE *opts_file; char line[1024]; - char *p_key, *p_val, *p_last; + char *p_key, *p_val; + char *p_field; - if (!p_subn->opt.config_file) + if (!p_opts->config_file) return 0; - opts_file = fopen(p_subn->opt.config_file, "r"); + opts_file = fopen(p_opts->config_file, "r"); if (!opts_file) { if (errno == ENOENT) return 1; OSM_LOG(&p_subn->p_osm->log, OSM_LOG_ERROR, "cannot open file \'%s\': %s\n", - p_subn->opt.config_file, strerror(errno)); + p_opts->config_file, strerror(errno)); return -1; } - subn_free_qos_options(&p_subn->opt.qos_options); - subn_free_qos_options(&p_subn->opt.qos_ca_options); - subn_free_qos_options(&p_subn->opt.qos_sw0_options); - subn_free_qos_options(&p_subn->opt.qos_swe_options); - subn_free_qos_options(&p_subn->opt.qos_rtr_options); + subn_free_qos_options(p_opts->qos); + subn_free_qos_options(p_opts->qos_ca); + subn_free_qos_options(p_opts->qos_sw0); + subn_free_qos_options(p_opts->qos_swe); + subn_free_qos_options(p_opts->qos_rtr); - subn_init_qos_options(&p_subn->opt.qos_options); - subn_init_qos_options(&p_subn->opt.qos_ca_options); - subn_init_qos_options(&p_subn->opt.qos_sw0_options); - subn_init_qos_options(&p_subn->opt.qos_swe_options); - subn_init_qos_options(&p_subn->opt.qos_rtr_options); + subn_init_qos_options(p_opts->qos); + subn_init_qos_options(p_opts->qos_ca); + subn_init_qos_options(p_opts->qos_sw0); + subn_init_qos_options(p_opts->qos_swe); + subn_init_qos_options(p_opts->qos_rtr); while (fgets(line, 1023, opts_file) != NULL) { /* get the first token */ - p_key = strtok_r(line, " \t\n", &p_last); - if (p_key) { - p_val = strtok_r(NULL, " \t\n", &p_last); - - subn_parse_qos_options("qos", p_key, p_val, - &p_subn->opt.qos_options); - - subn_parse_qos_options("qos_ca", p_key, p_val, - &p_subn->opt.qos_ca_options); - - subn_parse_qos_options("qos_sw0", p_key, p_val, - &p_subn->opt.qos_sw0_options); + p_key = strtok_r(line, " \t\n", &p_val); + if (!p_key) + continue; - subn_parse_qos_options("qos_swe", p_key, p_val, - &p_subn->opt.qos_swe_options); + p_val = clean_val(p_val); - subn_parse_qos_options("qos_rtr", p_key, p_val, - &p_subn->opt.qos_rtr_options); + for (r = opt_tbl; r->name; r++) { + if (!r->can_update || strcmp(r->name, p_key)) + continue; + p_field = (char *)p_opts + r->field_offset; + r->unpack_fn(p_subn, p_key, + p_val, p_field, r->update_fn); } } fclose(opts_file); - osm_subn_verify_config(&p_subn->opt); + osm_subn_verify_config(p_opts); osm_parse_prefix_routes_file(p_subn); @@ -1640,23 +1601,23 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) subn_dump_qos_options(out, "QoS default options", "qos", - &p_opts->qos_options); + p_opts->qos); fprintf(out, "\n"); subn_dump_qos_options(out, "QoS CA options", "qos_ca", - &p_opts->qos_ca_options); + p_opts->qos_ca); fprintf(out, "\n"); subn_dump_qos_options(out, "QoS Switch Port 0 options", "qos_sw0", - &p_opts->qos_sw0_options); + p_opts->qos_sw0); fprintf(out, "\n"); subn_dump_qos_options(out, "QoS Switch external ports options", "qos_swe", - &p_opts->qos_swe_options); + p_opts->qos_swe); fprintf(out, "\n"); subn_dump_qos_options(out, "QoS Router ports options", "qos_rtr", - &p_opts->qos_rtr_options); + p_opts->qos_rtr); fprintf(out, "\n"); fprintf(out, -- 1.5.5 From sashak at voltaire.com Tue Jan 20 07:05:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 20 Jan 2009 17:05:53 +0200 Subject: [ofa-general] [PATCH RFC] opensm: sort port order for routing by switch loads Message-ID: <20090120150553.GB28955@sashak.voltaire.com> It follows "port order" routing load balancer improvements (implemented using "--guid_routing_order_file" command line option). The idea of this patch is about default behavior and it is to balance routing paths in such order that most loaded links enter balancer first - in most cases it should provide a better performance than just random balancing (as it is done now by default). The implementation is simple - endport list for load balancer is reverse sorted by number of active endport links of leaf switches. Signed-off-by: Sasha Khapyorsky --- Comments are appreciated. Sasha opensm/opensm/osm_ucast_mgr.c | 58 ++++++++++++++++++++++++++++++++++++++++- 1 files changed, 57 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 96921a0..58a6714 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -744,6 +744,61 @@ static void clear_prof_ignore_flag(cl_map_item_t * const p_map_item, void *ctx) } } +static void add_sw_endports_to_order_list(osm_switch_t *sw, osm_ucast_mgr_t *m) +{ + osm_port_t *port; + osm_physp_t *p; + int i; + for (i = 1; i < sw->num_ports; i++) { + p = osm_node_get_physp_ptr(sw->p_node, i); + if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw) { + port = osm_get_port_by_guid(m->p_subn, + p->p_remote_physp->port_guid); + cl_qlist_insert_tail(&m->port_order_list, + &port->list_item); + port->flag = 1; + } + } +} + +static int sw_count_endport_links(osm_switch_t * const *s) +{ + const osm_switch_t *sw = *s; + int i, n = 0; + for (i = 1; i < sw->num_ports; i++) { + osm_physp_t *p = osm_node_get_physp_ptr(sw->p_node, i); + if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw && + ib_port_info_get_port_state(&p->port_info) == + IB_LINK_ACTIVE) + n++; + } + return n; +} + +static int compar_sw_load(const void *s1, const void *s2) +{ + return sw_count_endport_links(s2) - sw_count_endport_links(s1); +} + +static void sort_ports_by_switch_load(osm_ucast_mgr_t *m) +{ + int i, num = cl_qmap_count(&m->p_subn->sw_guid_tbl); + osm_switch_t **s = malloc(num * sizeof(*s)); + if (!s) { + OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR: " + "No memory, skip by switch load sorting.\n"); + return; + } + s[0] = (osm_switch_t *)cl_qmap_head(&m->p_subn->sw_guid_tbl); + for (i = 1; i < num; i++) + s[i] = (osm_switch_t *)cl_qmap_next(&s[i-1]->map_item); + + qsort(s, num, sizeof(*s), compar_sw_load); + + for (i = 0; i < num; i++) + add_sw_endports_to_order_list(s[i], m); +} + static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) { cl_qlist_init(&p_mgr->port_order_list); @@ -758,7 +813,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : " "cannot parse guid routing order file \'%s\'\n", p_mgr->p_subn->opt.guid_routing_order_file); - } + } else + sort_ports_by_switch_load(p_mgr); if (p_mgr->p_subn->opt.port_prof_ignore_file) { cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Tue Jan 20 09:43:13 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 20 Jan 2009 19:43:13 +0200 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> Message-ID: <20090120174313.GC28955@sashak.voltaire.com> Hi Arlin, On 11:50 Sat 17 Jan , Arlin Davis wrote: > > > Ok, here is revision 2 of the libibmad WinOF portability patches. I eliminated all #ifdef _WIN32 > crap and tried to limit the changes by adding os dependent file mad_osd.h. With these changes we > could share the same code base for both OFED and WinOF. Please review and consider accepting this > patch set. The patches (I tried to check only 1 and 2) are malformed - whitespaces are mangled (in 2), also long lines are broken. Please check that it is appliable with 'git am'. > > [PATCH 1/3] libibmad: add os dependent definitions. > [PATCH 2/3] field.c remove c99 definitions, better portability with WinOF. Is it possible to preserve c99 stuff? It improves code maintainability a lot and as far as I remember it is what was considered in previous discussions (about infiniband-diags porting to WinOF). > [PATCH 3/3] Minor changes to allow portability to WinOF > > infiniband/mad_osd.h added to provide support for os specific defintions > for portability. With these changes, WinOF can pull directly from OFED > git tree and share a common code base with minimal changes to mad.h and > source tree. > > mad.h modifications include MAD_EXPORT for export declarations where > appropriate. Datatype llu changed to ULL for 64bit constants. > > makefile.am modified to include new linux version of mad_osd.h > > Signed-off-by: Arlin Davis > --- > libibmad/Makefile.am | 7 +- > libibmad/include/infiniband/mad.h | 120 ++++++++++++++++----------------- > libibmad/include/infiniband/mad_osd.h | 48 +++++++++++++ > 3 files changed, 109 insertions(+), 66 deletions(-) > create mode 100644 libibmad/include/infiniband/mad_osd.h > > diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am > index beae1a4..8dea157 100644 > --- a/libibmad/Makefile.am > +++ b/libibmad/Makefile.am > @@ -1,7 +1,7 @@ > > SUBDIRS = . > > -INCLUDES = -I$(srcdir)/include/infiniband -I$(includedir) > +INCLUDES = -I$(srcdir)/include -I$(srcdir)/include/infiniband -I$(includedir) Something not related to porting... I would rather replace all '#include ' occurrences to '#include ' and then use '-I$(srcdir)/include' in INCLUDES definition. > > lib_LTLIBRARIES = libibmad.la > > @@ -23,9 +23,10 @@ libibmad_la_DEPENDENCIES = $(srcdir)/src/libibmad.map > > libibmadincludedir = $(includedir)/infiniband > > -libibmadinclude_HEADERS = $(srcdir)/include/infiniband/mad.h > +libibmadinclude_HEADERS = $(srcdir)/include/infiniband/mad.h $(srcdir)/include/infiniband/mad_osd.h > > -EXTRA_DIST = $(srcdir)/include/infiniband/mad.h libibmad.spec.in libibmad.spec \ > +EXTRA_DIST = $(srcdir)/include/infiniband/mad.h $(srcdir)/include/infiniband/mad_osd.h \ > + libibmad.spec.in libibmad.spec \ > $(srcdir)/src/libibmad.map libibmad.ver autogen.sh And if file is listed in library_HEADERS it will be distributed, so no need to list it in EXTRA_DIST (of course mad.h should be remove too). > > dist-hook: > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 0a962c0..fe607a7 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -33,13 +33,7 @@ > #ifndef _MAD_H_ > #define _MAD_H_ > > -#include > -#include > -#include > -#include > -#include > -#include > -#include > +#include Why should we remove all header files here? Some of them (such as stdio.h) are not really system dependent. Sasha > > #ifdef __cplusplus > # define BEGIN_C_DECLS extern "C" { > @@ -52,7 +46,7 @@ > BEGIN_C_DECLS > > #define IB_SUBNET_PATH_HOPS_MAX 64 > -#define IB_DEFAULT_SUBN_PREFIX 0xfe80000000000000llu > +#define IB_DEFAULT_SUBN_PREFIX 0xfe80000000000000ULL > #define IB_DEFAULT_QP1_QKEY 0x80010000 > > #define IB_MAD_SIZE 256 > @@ -627,10 +621,10 @@ enum { > /******************************************************************************/ > > /* portid.c */ > -char * portid2str(ib_portid_t *portid); > -int portid2portnum(ib_portid_t *portid); > -int str2drpath(ib_dr_path_t *path, char *routepath, int drslid, int drdlid); > -char * drpath2str(ib_dr_path_t *path, char *dstr, size_t dstr_size); > +MAD_EXPORT char * portid2str(ib_portid_t *portid); > +MAD_EXPORT int portid2portnum(ib_portid_t *portid); > +MAD_EXPORT int str2drpath(ib_dr_path_t *path, char *routepath, int drslid, int drdlid); > +MAD_EXPORT char * drpath2str(ib_dr_path_t *path, char *dstr, size_t dstr_size); > > static inline int > ib_portid_set(ib_portid_t *portid, int lid, int qp, int qkey) > @@ -644,44 +638,44 @@ ib_portid_set(ib_portid_t *portid, int lid, int qp, int qkey) > } > > /* fields.c */ > -uint32_t mad_get_field(void *buf, int base_offs, int field); > -void mad_set_field(void *buf, int base_offs, int field, uint32_t val); > +MAD_EXPORT uint32_t mad_get_field(void *buf, int base_offs, int field); > +MAD_EXPORT void mad_set_field(void *buf, int base_offs, int field, uint32_t val); > /* field must be byte aligned */ > -uint64_t mad_get_field64(void *buf, int base_offs, int field); > -void mad_set_field64(void *buf, int base_offs, int field, uint64_t val); > -void mad_set_array(void *buf, int base_offs, int field, void *val); > -void mad_get_array(void *buf, int base_offs, int field, void *val); > -void mad_decode_field(uint8_t *buf, int field, void *val); > -void mad_encode_field(uint8_t *buf, int field, void *val); > -int mad_print_field(int field, const char *name, void *val); > -char *mad_dump_field(int field, char *buf, int bufsz, void *val); > -char *mad_dump_val(int field, char *buf, int bufsz, void *val); > +MAD_EXPORT uint64_t mad_get_field64(void *buf, int base_offs, int field); > +MAD_EXPORT void mad_set_field64(void *buf, int base_offs, int field, uint64_t val); > +MAD_EXPORT void mad_set_array(void *buf, int base_offs, int field, void *val); > +MAD_EXPORT void mad_get_array(void *buf, int base_offs, int field, void *val); > +MAD_EXPORT void mad_decode_field(uint8_t *buf, int field, void *val); > +MAD_EXPORT void mad_encode_field(uint8_t *buf, int field, void *val); > +MAD_EXPORT int mad_print_field(int field, const char *name, void *val); > +MAD_EXPORT char *mad_dump_field(int field, char *buf, int bufsz, void *val); > +MAD_EXPORT char *mad_dump_val(int field, char *buf, int bufsz, void *val); > > /* mad.c */ > -void *mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, void *data); > -uint64_t mad_trid(void); > -int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, void *data); > +MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t *rpc, ib_dr_path_t *drpath, void *data); > +MAD_EXPORT uint64_t mad_trid(void); > +MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, > void *data); > > /* register.c */ > -int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); > -int mad_register_client(int mgmt, uint8_t rmpp_version); > -int mad_register_server(int mgmt, uint8_t rmpp_version, > +MAD_EXPORT int mad_register_port_client(int port_id, int mgmt, uint8_t rmpp_version); > +MAD_EXPORT int mad_register_client(int mgmt, uint8_t rmpp_version); > +MAD_EXPORT int mad_register_server(int mgmt, uint8_t rmpp_version, > long method_mask[16/sizeof(long)], > uint32_t class_oui); > -int mad_class_agent(int mgmt); > -int mad_agent_class(int agent); > +MAD_EXPORT int mad_class_agent(int mgmt); > +MAD_EXPORT int mad_agent_class(int agent); > > /* serv.c */ > -int mad_send(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, > +MAD_EXPORT int mad_send(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, > void *data); > -void * mad_receive(void *umad, int timeout); > -int mad_respond(void *umad, ib_portid_t *portid, uint32_t rstatus); > -void * mad_alloc(void); > -void mad_free(void *umad); > +MAD_EXPORT void * mad_receive(void *umad, int timeout); > +MAD_EXPORT int mad_respond(void *umad, ib_portid_t *portid, uint32_t rstatus); > +MAD_EXPORT void * mad_alloc(void); > +MAD_EXPORT void mad_free(void *umad); > > /* vendor.c */ > -uint8_t *ib_vendor_call(void *data, ib_portid_t *portid, > - ib_vendor_call_t *call); > +MAD_EXPORT uint8_t *ib_vendor_call(void *data, ib_portid_t *portid, > + ib_vendor_call_t *call); > > static inline int > mad_is_vendor_range1(int mgmt) > @@ -696,29 +690,29 @@ mad_is_vendor_range2(int mgmt) > } > > /* rpc.c */ > -int madrpc_portid(void); > -int madrpc_set_retries(int retries); > -int madrpc_set_timeout(int timeout); > -void * madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata); > -void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, > +MAD_EXPORT int madrpc_portid(void); > +MAD_EXPORT int madrpc_set_retries(int retries); > +MAD_EXPORT int madrpc_set_timeout(int timeout); > +void * madrpc(ib_rpc_t *rpc, ib_portid_t *dport, void *payload, void *rcvdata); > +void * madrpc_rmpp(ib_rpc_t *rpc, ib_portid_t *dport, ib_rmpp_hdr_t *rmpp, > void *data); > -void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > +MAD_EXPORT void madrpc_init(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > void madrpc_save_mad(void *madbuf, int len); > -void madrpc_show_errors(int set); > +MAD_EXPORT void madrpc_show_errors(int set); > > -void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > +void * mad_rpc_open_port(char *dev_name, int dev_port, int *mgmt_classes, > int num_classes); > void mad_rpc_close_port(void *ibmad_port); > -void * mad_rpc(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > +void * mad_rpc(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > void *payload, void *rcvdata); > -void * mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > +void * mad_rpc_rmpp(const void *ibmad_port, ib_rpc_t *rpc, ib_portid_t *dport, > ib_rmpp_hdr_t *rmpp, void *data); > > /* smp.c */ > -uint8_t * smp_query(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, > +MAD_EXPORT uint8_t * smp_query(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, > unsigned timeout); > -uint8_t * smp_set(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, > +MAD_EXPORT uint8_t * smp_set(void *buf, ib_portid_t *id, unsigned attrid, unsigned mod, > unsigned timeout); > uint8_t * smp_query_via(void *buf, ib_portid_t *id, unsigned attrid, > unsigned mod, unsigned timeout, const void *srcport); > @@ -730,18 +724,18 @@ uint8_t * sa_call(void *rcvbuf, ib_portid_t *portid, ib_sa_call_t *sa, > unsigned timeout); > uint8_t * sa_rpc_call(const void *ibmad_port, void *rcvbuf, ib_portid_t *portid, > ib_sa_call_t *sa, unsigned timeout); > -int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id, > +MAD_EXPORT int ib_path_query(ibmad_gid_t srcgid, ibmad_gid_t destgid, ib_portid_t *sm_id, > void *buf); /* returns lid */ > int ib_path_query_via(const void *srcport, ibmad_gid_t srcgid, > ibmad_gid_t destgid, ib_portid_t *sm_id, void *buf); > > /* resolve.c */ > -int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); > -int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, > +MAD_EXPORT int ib_resolve_smlid(ib_portid_t *sm_id, int timeout); > +MAD_EXPORT int ib_resolve_guid(ib_portid_t *portid, uint64_t *guid, > ib_portid_t *sm_id, int timeout); > -int ib_resolve_portid_str(ib_portid_t *portid, char *addr_str, > +MAD_EXPORT int ib_resolve_portid_str(ib_portid_t *portid, char *addr_str, > int dest_type, ib_portid_t *sm_id); > -int ib_resolve_self(ib_portid_t *portid, int *portnum, ibmad_gid_t *gid); > +MAD_EXPORT int ib_resolve_self(ib_portid_t *portid, int *portnum, ibmad_gid_t *gid); > > int ib_resolve_smlid_via(ib_portid_t *sm_id, int timeout, > const void *srcport); > @@ -755,19 +749,19 @@ int ib_resolve_self_via(ib_portid_t *portid, int *portnum, ibmad_gid_t > *gid, > const void *srcport); > > /* gs.c */ > -uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t *dest, int port, > +MAD_EXPORT uint8_t *perf_classportinfo_query(void *rcvbuf, ib_portid_t *dest, int port, > unsigned timeout); > -uint8_t *port_performance_query(void *rcvbuf, ib_portid_t *dest, int port, > +MAD_EXPORT uint8_t *port_performance_query(void *rcvbuf, ib_portid_t *dest, int port, > unsigned timeout); > -uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t *dest, int port, > +MAD_EXPORT uint8_t *port_performance_reset(void *rcvbuf, ib_portid_t *dest, int port, > unsigned mask, unsigned timeout); > -uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t *dest, int port, > +MAD_EXPORT uint8_t *port_performance_ext_query(void *rcvbuf, ib_portid_t *dest, int port, > unsigned timeout); > -uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t *dest, int port, > +MAD_EXPORT uint8_t *port_performance_ext_reset(void *rcvbuf, ib_portid_t *dest, int port, > unsigned mask, unsigned timeout); > -uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t *dest, int port, > +MAD_EXPORT uint8_t *port_samples_control_query(void *rcvbuf, ib_portid_t *dest, int port, > unsigned timeout); > -uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t *dest, int port, > +MAD_EXPORT uint8_t *port_samples_result_query(void *rcvbuf, ib_portid_t *dest, int port, > unsigned timeout); > > uint8_t *perf_classportinfo_query_via(void *rcvbuf, ib_portid_t *dest, int port, > @@ -785,7 +779,7 @@ uint8_t *port_samples_control_query_via(void *rcvbuf, ib_portid_t *dest, int por > uint8_t *port_samples_result_query_via(void *rcvbuf, ib_portid_t *dest, int port, > unsigned timeout, const void *srcport); > /* dump.c */ > -ib_mad_dump_fn > +MAD_EXPORT ib_mad_dump_fn > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > mad_dump_bitfield, mad_dump_array, mad_dump_string, > mad_dump_linkwidth, mad_dump_linkwidthsup, mad_dump_linkwidthen, > diff --git a/libibmad/include/infiniband/mad_osd.h b/libibmad/include/infiniband/mad_osd.h > new file mode 100644 > index 0000000..45741c5 > --- /dev/null > +++ b/libibmad/include/infiniband/mad_osd.h > @@ -0,0 +1,49 @@ > +/* > + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > + * Copyright (c) 2009 Intel Corporation All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > +#ifndef _MAD_OSD_H_ > +#define _MAD_OSD_H_ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define MAD_EXPORT > + > +#endif /* _MAD_OSD_H_ */ > -- > 1.5.2.5 > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sean.hefty at intel.com Tue Jan 20 09:52:29 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 20 Jan 2009 09:52:29 -0800 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <20090120174313.GC28955@sashak.voltaire.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> Message-ID: <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> >> [PATCH 1/3] libibmad: add os dependent definitions. >> [PATCH 2/3] field.c remove c99 definitions, better portability with WinOF. > >Is it possible to preserve c99 stuff? It improves code maintainability >a lot and as far as I remember it is what was considered in previous >discussions (about infiniband-diags porting to WinOF). It is not possible to keep this unless OFA changes their Windows build environment or process. - Sean From celine.bourde at ext.bull.net Tue Jan 20 09:47:45 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Tue, 20 Jan 2009 18:47:45 +0100 Subject: [ofa-general]SRP target Message-ID: <20090120174745.GA6366@frecb000697.frec.bull.fr> I've trouble running srp tools. I try to setup the following configuration. The first machine,``the initiator'', has a 2-port Connectx IB card. Each port is connected to the a second machine, ``the target'', which has a 2-port Connectx IB card too. +-----------------+ +---------------+ | |- - - - - - - - - - -| | | initiator | | target | | |- - - - - - - - - - -| | +-----------------+ +---------------+ the ibstat utility shows that both link are up. My goal is to setup a failover connection. On the target machine, both ports expose the the same storage device. on my "target machine", I see 2 targets : # cat /sys/class/infiniband_srpt/srpt-mlx4_0/login_info tid_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,pkey=ffff,dgid=fe800000000000000002c90300001d2d,service_id=0002c90300001d2c tid_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,pkey=ffff,dgid=fe800000000000000002c90300001d2e,service_id=0002c90300001d2c But when I launch ibsrpdm -c on "initiator machine", the tool discovers only one target corresponding to the first port of my distant HCA card. # ibsrpdm -c id_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,dgid=fe800000000000000002c90300001d2d,pkey=ffff,service_id=0002c90300001d2c # srp_daemon IO Unit Info: port LID: 0002 port GID: fe800000000000000002c90300001d2d change ID: 0100 max controllers: 0x10 controller[ 1] GUID: 0002c90300001d2c vendor ID: 000002 device ID: 00634a IO class : 0100 ID: Mellanox OFED SRP target service entries: 1 service[ 0]: 0002c90300001d2c / SRP.T10:0002c90300001d2c so i see only one of the targets; any hints why i don't see both targets? Thanks, Céline Bourde. From eli at mellanox.co.il Tue Jan 20 10:04:06 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Tue, 20 Jan 2009 20:04:06 +0200 Subject: [ofa-general] [PATCH] mlx4_ib: Optimize hugetlab pages support Message-ID: <20090120180406.GA9991@mtls03> Since Linux does not merge adjacent pages into a single scatter entry through calls to dma_map_sg(), we do this for huge pages which are guaranteed to be comprised of adjacent natural pages of size PAGE_SIZE. This will result in a significantly lower number of MTT segments used for registering hugetlb memory regions. Signed-off-by: Eli Cohen --- I tried this patch and improves memory scalability. Without it, increasing the amount of memory used cause caused decrease in throughput. With this patch there was no drop at all - I used a test based on MPI with 2 processes. drivers/infiniband/hw/mlx4/mr.c | 67 ++++++++++++++++++++++++++++++++++---- 1 files changed, 60 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 8e4d26d..641ea96 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -119,6 +119,38 @@ out: return err; } +int mlx4_ib_umem_write_huge_mtt(struct mlx4_ib_dev *dev, struct mlx4_mtt *mtt, + struct ib_umem *umem, int nhpages) +{ + struct ib_umem_chunk *chunk; + int j, i, k; + dma_addr_t *arr; + int err; + + arr = kmalloc(nhpages * sizeof *arr, GFP_KERNEL); + if (!arr) + return -ENOMEM; + + i = 0; + k = 0; + list_for_each_entry(chunk, &umem->chunk_list, list) + for (j = 0; j < chunk->nmap; ++j, ++k) { + if (!(k & ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1))) { + if (sg_dma_len(&chunk->page_list[j]) != PAGE_SIZE) { + err = -EAGAIN; + goto out; + } + arr[i++] = sg_dma_address(&chunk->page_list[j]); + } + } + + err = mlx4_write_mtt(dev->dev, mtt, 0, nhpages, arr); + +out: + kfree(arr); + return err; +} + struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt_addr, int access_flags, struct ib_udata *udata) @@ -128,6 +160,8 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, int shift; int err; int n; + int nhuge; + int shift_huge; mr = kmalloc(sizeof *mr, GFP_KERNEL); if (!mr) @@ -142,15 +176,34 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, n = ib_umem_page_count(mr->umem); shift = ilog2(mr->umem->page_size); - - err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, + if (mr->umem->hugetlb) { + nhuge = ALIGN(n << shift, HPAGE_SIZE) >> HPAGE_SHIFT; + shift_huge = HPAGE_SHIFT; + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, + convert_access(access_flags), nhuge, shift_huge, &mr->mmr); + if (err) + goto err_umem; + + err = mlx4_ib_umem_write_huge_mtt(dev, &mr->mmr.mtt, mr->umem, nhuge); + if (err) { + if (err != -EAGAIN) + goto err_mr; + else { + mlx4_mr_free(to_mdev(pd->device)->dev, &mr->mmr); + goto regular_pages; + } + } + } else { +regular_pages: + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, convert_access(access_flags), n, shift, &mr->mmr); - if (err) - goto err_umem; + if (err) + goto err_umem; - err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); - if (err) - goto err_mr; + err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); + if (err) + goto err_mr; + } err = mlx4_mr_enable(dev->dev, &mr->mmr); if (err) -- 1.6.0.5 From celine.bourde at bull.net Tue Jan 20 09:43:14 2009 From: celine.bourde at bull.net (Celine Bourde) Date: Tue, 20 Jan 2009 18:43:14 +0100 Subject: [ofa-general]SRP target Message-ID: <20090120174314.GA6233@frecb000697.frec.bull.fr> I've trouble running srp tools. I try to setup the following configuration. The first machine,``the initiator'', has a 2-port Connectx IB card. Each port is connected to the a second machine, ``the target'', which has a 2-port Connectx IB card too. +-----------------+ +---------------+ | |- - - - - - - - - - -| | | initiator | | target | | |- - - - - - - - - - -| | +-----------------+ +---------------+ the ibstat utility shows that both link are up. My goal is to setup a failover connection. On the target machine, both ports expose the the same storage device. on my "target machine", I see 2 targets : # cat /sys/class/infiniband_srpt/srpt-mlx4_0/login_info tid_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,pkey=ffff,dgid=fe800000000000000002c90300001d2d,service_id=0002c90300001d2c tid_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,pkey=ffff,dgid=fe800000000000000002c90300001d2e,service_id=0002c90300001d2c But when I launch ibsrpdm -c on "initiator machine", the tool discovers only one target corresponding to the first port of my distant HCA card. # ibsrpdm -c id_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,dgid=fe800000000000000002c90300001d2d,pkey=ffff,service_id=0002c90300001d2c # srp_daemon IO Unit Info: port LID: 0002 port GID: fe800000000000000002c90300001d2d change ID: 0100 max controllers: 0x10 controller[ 1] GUID: 0002c90300001d2c vendor ID: 000002 device ID: 00634a IO class : 0100 ID: Mellanox OFED SRP target service entries: 1 service[ 0]: 0002c90300001d2c / SRP.T10:0002c90300001d2c so i see only one of the targets; any hints why i don't see both targets? Thanks, Céline Bourde. From mschlining at ddn.com Tue Jan 20 10:20:03 2009 From: mschlining at ddn.com (Marty Schlining) Date: Tue, 20 Jan 2009 10:20:03 -0800 Subject: [ofa-general]SRP target In-Reply-To: <20090120174314.GA6233@frecb000697.frec.bull.fr> References: <20090120174314.GA6233@frecb000697.frec.bull.fr> Message-ID: <60BA2AA14940C9429038D4E2BC53008D16C772B8B1@MAILBOXCLUSTER.datadirect.datadirectnet.com> Try running srp_daemon with the -n option. This adds a unique initiator_ext identifier to the each SRP target port. Also, opensm should be running on each port of the initiator. Example: For each active umad device: srp_daemon -o -e -n -d /dev/infiniband/$umad Example script to reload srp drivers and add targets.: #!/bin/bash declare -a UMAD declare -a HCA_DEVICE declare -a PORTS declare -a SPEEDS declare -a WIDTHS declare -a STATE declare -a PORT_GUID let PORT_COUNT=0 #--------------------------------------------------------------- # Reload all important infiniband drivers #--------------------------------------------------------------- echo echo "Unloading ib_srp" modprobe -r ib_srp echo "Loading ib_srp" modprobe ib_srp #--------------------------------------------------------------- # Kill all instances of opensm #--------------------------------------------------------------- echo "Stopping opensm" killall opensm sleep 2 echo "REALLY stopping opensm" killall -9 opensm #--------------------------------------------------------------- # Run opensm on each port (even if there is no link) #--------------------------------------------------------------- echo ibstat | grep "Port GUID" | awk '{print $3}' | \ while read a do echo "Starting OpenSM on port $a" opensm -g $a > /dev/null 2> /dev/null & done #-------------------------------------------------------------- # Delay before opensm settles down #-------------------------------------------------------------- echo echo "Waiting 20 seconds for everything to stabilize" sleep 20 #-------------------------------------------------------------- # Gather and print information about IB ports #-------------------------------------------------------------- printf "\n"; echo "UMAD Dev:Port Speed STATE Port GUID" echo "------------------------------------------------------" INDEX=0 for a in /sys/class/infiniband_mad/umad* do UMAD[$INDEX]=`basename $a` HCA_DEVICE[$INDEX]=`cat $a/ibdev` PORTS[$INDEX]=`cat $a/port` SPEEDS[$INDEX]=`cat /sys/class/infiniband/${HCA_DEVICE[$INDEX]}/ports/${PORTS[$INDEX]}/rate | awk '{print $1 $3 $4}'` STATE[$INDEX]=`cat /sys/class/infiniband/${HCA_DEVICE[$INDEX]}/ports/${PORTS[$INDEX]}/state | awk '{print $2}'` PORT_GUID[$INDEX]=`ibstat ${HCA_DEVICE[$INDEX]} ${PORTS[$INDEX]} | grep "Port GUID" | awk '{print $3}'` printf "%s %s:%d %-9s %-7s %s\n" ${UMAD[$INDEX]} \ ${HCA_DEVICE[$INDEX]} \ ${PORTS[$INDEX]} \ ${SPEEDS[$INDEX]} \ ${STATE[$INDEX]} \ ${PORT_GUID[$INDEX]} ; let INDEX=INDEX+1 done let PORT_COUNT=INDEX+1 echo #--------------------------------------------------------------- # for each umad device, assign all srp targets detected on that # device to its port. Assumes there will be an SRP target on # each umad device that has an ACTIVE status. ibsrpdm hangs if # no SRP targets are on the # specified mad device. #--------------------------------------------------------------- INDEX=0 for ((i=1; i < ${PORT_COUNT} ; i++)) do umad=${UMAD[$INDEX]} ibdev=${HCA_DEVICE[$INDEX]} port=${PORTS[$INDEX]} state=${STATE[$INDEX]} if [ "$state" = "ACTIVE" ]; then echo "Looking for devices on $ibdev port $port ($umad)" # This is the preferred method for adding SRP targets, Uses the contents # of the default config file /etc/srp_daemon.conf srp_daemon -o -e -n -d /dev/infiniband/$umad #-------------------------------------------------------------- # Old (but still used) methods for adding SRP targets. #-------------------------------------------------------------- # ibsrpdm -d /dev/infiniband/$umad -c | awk '{ORS="";print $1",max_sect=65535"}' > /sys/class/infiniband_srp/srp-$ibdev-$port/add_target # ibsrpdm -d /dev/infiniband/$umad -c | awk '{print $1",max_sect=4096"}' # ibsrpdm -d /dev/infiniband/$umad -c | awk '{ORS="";print $1",max_sect=65535"}' > /sys/class/infiniband_srp/srp-$ibdev-$port/add_target fi #-------------------------------------------------------------- let INDEX=INDEX+1 done #-------------------------------------------------------------- # List all LUN targets #-------------------------------------------------------------- echo echo "Target LUN listing" lsscsi -gl echo #-------------------------------------------------------------- # List all SCSI hosts #-------------------------------------------------------------- echo "Host listing" lsscsi -Hl echo -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Celine Bourde Sent: Tuesday, January 20, 2009 12:43 PM To: general at lists.openfabrics.org; celine.bourde at bull.net Subject: [ofa-general]SRP target I've trouble running srp tools. I try to setup the following configuration. The first machine,``the initiator'', has a 2-port Connectx IB card. Each port is connected to the a second machine, ``the target'', which has a 2-port Connectx IB card too. +-----------------+ +---------------+ | |- - - - - - - - - - -| | | initiator | | target | | |- - - - - - - - - - -| | +-----------------+ +---------------+ the ibstat utility shows that both link are up. My goal is to setup a failover connection. On the target machine, both ports expose the the same storage device. on my "target machine", I see 2 targets : # cat /sys/class/infiniband_srpt/srpt-mlx4_0/login_info tid_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,pkey=ffff,dgid=fe800000000000000002c90300001d2d,service_id=0002c90300001d2c tid_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,pkey=ffff,dgid=fe800000000000000002c90300001d2e,service_id=0002c90300001d2c But when I launch ibsrpdm -c on "initiator machine", the tool discovers only one target corresponding to the first port of my distant HCA card. # ibsrpdm -c id_ext=0002c90300001d2c,ioc_guid=0002c90300001d2c,dgid=fe800000000000000002c90300001d2d,pkey=ffff,service_id=0002c90300001d2c # srp_daemon IO Unit Info: port LID: 0002 port GID: fe800000000000000002c90300001d2d change ID: 0100 max controllers: 0x10 controller[ 1] GUID: 0002c90300001d2c vendor ID: 000002 device ID: 00634a IO class : 0100 ID: Mellanox OFED SRP target service entries: 1 service[ 0]: 0002c90300001d2c / SRP.T10:0002c90300001d2c so i see only one of the targets; any hints why i don't see both targets? Thanks, Céline Bourde. _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Tue Jan 20 10:42:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 20 Jan 2009 20:42:53 +0200 Subject: [ofa-general] [PATCH] libibmad: cleanup mad.h include path In-Reply-To: <20090120174313.GC28955@sashak.voltaire.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> Message-ID: <20090120184253.GD28955@sashak.voltaire.com> Cleanup mad.h include path. Also improve some automake INCLUDES and EXTRA_DIST definition - remove not needed stuff. Signed-off-by: Sasha Khapyorsky --- libibmad/Makefile.am | 4 ++-- libibmad/src/dump.c | 2 +- libibmad/src/fields.c | 2 +- libibmad/src/gs.c | 2 +- libibmad/src/mad.c | 2 +- libibmad/src/portid.c | 2 +- libibmad/src/register.c | 2 +- libibmad/src/resolve.c | 2 +- libibmad/src/rpc.c | 2 +- libibmad/src/sa.c | 2 +- libibmad/src/serv.c | 2 +- libibmad/src/smp.c | 2 +- libibmad/src/vendor.c | 2 +- 13 files changed, 14 insertions(+), 14 deletions(-) diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am index beae1a4..74a8d10 100644 --- a/libibmad/Makefile.am +++ b/libibmad/Makefile.am @@ -1,7 +1,7 @@ SUBDIRS = . -INCLUDES = -I$(srcdir)/include/infiniband -I$(includedir) +INCLUDES = -I$(srcdir)/include -I$(includedir) lib_LTLIBRARIES = libibmad.la @@ -25,7 +25,7 @@ libibmadincludedir = $(includedir)/infiniband libibmadinclude_HEADERS = $(srcdir)/include/infiniband/mad.h -EXTRA_DIST = $(srcdir)/include/infiniband/mad.h libibmad.spec.in libibmad.spec \ +EXTRA_DIST = libibmad.spec.in libibmad.spec \ $(srcdir)/src/libibmad.map libibmad.ver autogen.sh dist-hook: diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c index 38a2254..1b94104 100644 --- a/libibmad/src/dump.c +++ b/libibmad/src/dump.c @@ -43,7 +43,7 @@ #include #include -#include +#include void mad_dump_int(char *buf, int bufsz, void *val, int valsz) diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index 5cebd01..37a2b06 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -40,7 +40,7 @@ #include #include -#include +#include /* * BITSOFFS and BE_OFFS are required due the fact that the bit offsets are inconsistently diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index 9a89af2..629be3e 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -41,7 +41,7 @@ #include #include -#include "mad.h" +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c index be27c09..442343e 100644 --- a/libibmad/src/mad.c +++ b/libibmad/src/mad.c @@ -42,7 +42,7 @@ #include #include -#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/portid.c b/libibmad/src/portid.c index 61e6be0..76611fb 100644 --- a/libibmad/src/portid.c +++ b/libibmad/src/portid.c @@ -42,7 +42,7 @@ #include #include -#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/register.c b/libibmad/src/register.c index 045f840..bddab6a 100644 --- a/libibmad/src/register.c +++ b/libibmad/src/register.c @@ -42,7 +42,7 @@ #include #include -#include "mad.h" +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index 906b28d..c641334 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -41,7 +41,7 @@ #include #include -#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index 670a936..535e3d0 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -42,7 +42,7 @@ #include #include -#include "mad.h" +#include #define MAX_CLASS 256 diff --git a/libibmad/src/sa.c b/libibmad/src/sa.c index c601254..baa6da7 100644 --- a/libibmad/src/sa.c +++ b/libibmad/src/sa.c @@ -40,7 +40,7 @@ #include #include -#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/serv.c b/libibmad/src/serv.c index b329352..6865b81 100644 --- a/libibmad/src/serv.c +++ b/libibmad/src/serv.c @@ -42,7 +42,7 @@ #include #include -#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/smp.c b/libibmad/src/smp.c index ad6b066..b3314fb 100644 --- a/libibmad/src/smp.c +++ b/libibmad/src/smp.c @@ -40,7 +40,7 @@ #include #include -#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/vendor.c b/libibmad/src/vendor.c index eb703f6..eb6eaf5 100644 --- a/libibmad/src/vendor.c +++ b/libibmad/src/vendor.c @@ -40,7 +40,7 @@ #include #include -#include +#include #undef DEBUG #define DEBUG if (ibdebug) IBWARN -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Tue Jan 20 11:01:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 20 Jan 2009 21:01:11 +0200 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> Message-ID: <20090120190111.GE28955@sashak.voltaire.com> On 09:52 Tue 20 Jan , Sean Hefty wrote: > >> [PATCH 1/3] libibmad: add os dependent definitions. > >> [PATCH 2/3] field.c remove c99 definitions, better portability with WinOF. > > > >Is it possible to preserve c99 stuff? It improves code maintainability > >a lot and as far as I remember it is what was considered in previous > >discussions (about infiniband-diags porting to WinOF). > > It is not possible to keep this unless OFA changes their Windows build > environment or process. What about to use c99_to_whatever preprocessor (sed -e 's/\[.*\] *= *//')? Sasha From weiny2 at llnl.gov Tue Jan 20 10:59:47 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 20 Jan 2009 10:59:47 -0800 Subject: [ofa-general] Question about perfmgr on OpenSM In-Reply-To: <4970818E.8070602@ext.bull.net> References: <4970818E.8070602@ext.bull.net> Message-ID: <20090120105947.25b66481.weiny2@llnl.gov> On Fri, 16 Jan 2009 13:46:06 +0100 Nicolas Morey Chaisemartin wrote: > Hi, > > I'm working on a perf manager plugin for OpenSM. > It will have quite advanced features which will do more than simply > treating received events and extract some informations from OpenSM. > I have one question though: how are perf manager plugins managed when > there are multiple OpenSM running on the same subnet? > -Is only the one on the MASTER SM started? > -Are they all started but only the one on the MASTER SM received events > (counters/trap) > -They all start and received events but coming from different par of the > subnet > I am not quite sure what you are asking. Do you have multiple OpenSM's running on the same subnet perhaps a master and a standby? The plugins are managed via the OpenSM process. Each OpenSM which is started will load the plugins specified in the opensm.conf file. If those plugins are the same and would conflict, due to a common database for example, you will have to synchronize them yourself. Each SM will report events to all of it's plugins. If you are using the new plugin interface I think the easiest implimentation might be to have your plugin query the SM and if it is in standby ignore the events being reported. If the standby becomes master then the plugin could "turn itself on". Is this what you are trying to do? > > Due to the data I'm going to use, I need to have only one thread doing > its job at a time. > If there is only one plugin running on the Subnet it'll be fine, if > there are several, I'll have to monitor the SM status to ensure only one > of my threads is active. I think this is what you will have to do. The SM does not communicate the plugin information to the other SM's on the Subnet. Hope this helps, Ira From arlin.r.davis at intel.com Tue Jan 20 11:02:55 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 20 Jan 2009 11:02:55 -0800 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <20090120174313.GC28955@sashak.voltaire.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> Message-ID: >The patches (I tried to check only 1 and 2) are malformed - whitespaces >are mangled (in 2), also long lines are broken. Please check that it is >appliable with 'git am'. Sorry, I was just upgrade to exchange server 2007. Need I say more. Patches attached as separate files. >Is it possible to preserve c99 stuff? It improves code maintainability >a lot and as far as I remember it is what was considered in previous >discussions (about infiniband-diags porting to WinOF). Like Sean said, until WinOF changes the build enviroment we have no choice. The majority of the changes went into field.c and given the structure includes a character field with the mad field name I would think maintainability is preserved. I added your recent PortXmitWait and CounterSelect2 changes with no problem. > >Something not related to porting... > >I would rather replace all '#include ' occurrences to >'#include ' and then use '-I$(srcdir)/include' in >INCLUDES definition. > No problem. >And if file is listed in library_HEADERS it will be distributed, so no >need to list it in EXTRA_DIST (of course mad.h should be remove too). > Ok. >> -#include >> -#include >> -#include >> -#include >> -#include >> -#include >> -#include >> +#include > >Why should we remove all header files here? Some of them (such as >stdio.h) are not really system dependent. We can keep these in mad.h. Just thinking beyond windows/linux. -arlin -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-Minor-changes-to-source-to-allow-portability-to-WinO.patch Type: application/octet-stream Size: 9465 bytes Desc: 0003-Minor-changes-to-source-to-allow-portability-to-WinO.patch URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-libibmad-add-os-dependent-definitions.patch Type: application/octet-stream Size: 14485 bytes Desc: 0001-libibmad-add-os-dependent-definitions.patch URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-remove-c99-definitions-within-the-ib_mad_f-structure.patch Type: application/octet-stream Size: 27448 bytes Desc: 0002-remove-c99-definitions-within-the-ib_mad_f-structure.patch URL: From stan.smith at intel.com Tue Jan 20 11:04:59 2009 From: stan.smith at intel.com (Smith, Stan) Date: Tue, 20 Jan 2009 11:04:59 -0800 Subject: [ofw] Re: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <20090120190111.GE28955@sashak.voltaire.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> <20090120190111.GE28955@sashak.voltaire.com> Message-ID: <3F6F638B8D880340AB536D29CD4C1E192FA9E22F@orsmsx501.amr.corp.intel.com> Sasha Khapyorsky wrote: > On 09:52 Tue 20 Jan , Sean Hefty wrote: >>>> [PATCH 1/3] libibmad: add os dependent definitions. >>>> [PATCH 2/3] field.c remove c99 definitions, better portability >>>> with WinOF. >>> >>> Is it possible to preserve c99 stuff? It improves code >>> maintainability a lot and as far as I remember it is what was >>> considered in previous discussions (about infiniband-diags porting >>> to WinOF). >> >> It is not possible to keep this unless OFA changes their Windows >> build environment or process. > > What about to use c99_to_whatever preprocessor > (sed -e 's/\[.*\] *= *//')? Why are '#ifdef _WIN32' such, as you stated, crap? Portability has a price. Cosmetically they are not pretty, although useful if used sparingly. What about placing the offending definitions in mad_osd.h as you had proposed? > > Sasha > _______________________________________________ > ofw mailing list > ofw at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw From sean.hefty at intel.com Tue Jan 20 11:12:55 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 20 Jan 2009 11:12:55 -0800 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <20090120190111.GE28955@sashak.voltaire.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> <20090120190111.GE28955@sashak.voltaire.com> Message-ID: <7989935C4AA44367A115F0776F67CF17@amr.corp.intel.com> >What about to use c99_to_whatever preprocessor >(sed -e 's/\[.*\] *= *//')? I don't think this helps. Unless the definitions are in the correct location, the Windows build is still wrong. It falsely implies that the ordering doesn't matter. - Sean From sashak at voltaire.com Tue Jan 20 11:24:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 20 Jan 2009 21:24:25 +0200 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <7989935C4AA44367A115F0776F67CF17@amr.corp.intel.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> <20090120190111.GE28955@sashak.voltaire.com> <7989935C4AA44367A115F0776F67CF17@amr.corp.intel.com> Message-ID: <20090120192413.GA2037@sashak.voltaire.com> On 11:12 Tue 20 Jan , Sean Hefty wrote: > >What about to use c99_to_whatever preprocessor > >(sed -e 's/\[.*\] *= *//')? > > I don't think this helps. Unless the definitions are in the correct location, Sure. And obviously I was about more sofisticated preprocessor ([] constans should be evaluated and an array sorted properly, etc). Sasha > the Windows build is still wrong. It falsely implies that the ordering doesn't > matter. > > - Sean > From devel at morey-chaisemartin.com Tue Jan 20 12:57:18 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Tue, 20 Jan 2009 21:57:18 +0100 Subject: [ofa-general] Question about perfmgr on OpenSM In-Reply-To: <20090120105947.25b66481.weiny2@llnl.gov> References: <4970818E.8070602@ext.bull.net> <20090120105947.25b66481.weiny2@llnl.gov> Message-ID: <49763AAE.1010605@morey-chaisemartin.com> Ira Weiny a écrit : > On Fri, 16 Jan 2009 13:46:06 +0100 > Nicolas Morey Chaisemartin wrote: > > >> Hi, >> >> I'm working on a perf manager plugin for OpenSM. >> It will have quite advanced features which will do more than simply >> treating received events and extract some informations from OpenSM. >> I have one question though: how are perf manager plugins managed when >> there are multiple OpenSM running on the same subnet? >> -Is only the one on the MASTER SM started? >> -Are they all started but only the one on the MASTER SM received events >> (counters/trap) >> -They all start and received events but coming from different par of the >> subnet >> >> > > I am not quite sure what you are asking. Do you have multiple OpenSM's running > on the same subnet perhaps a master and a standby? > > Yes. For HA reasons, it is necessary for us to have a couple of OpenSM running on the subnet. > The plugins are managed via the OpenSM process. Each OpenSM which is started > will load the plugins specified in the opensm.conf file. If those plugins are > the same and would conflict, due to a common database for example, you will > have to synchronize them yourself. Each SM will report events to all > of it's plugins. > That's what I expected from reading the code. So even in STANDBY mode, openSM will report all the counters (data, errors,...) ? Or only traps? > If you are using the new plugin interface I think the easiest implimentation > might be to have your plugin query the SM and if it is in standby ignore the > events being reported. If the standby becomes master then the plugin could > "turn itself on". > > Is this what you are trying to do? > > The new plugin interface? My plugin is based on libopensmskummeeplugin (though only tiny pieces of the original code remains). However I'm still using providing the same functions, and have the same overall behaviour. Bit what I did is basically a state machine where openSM state triggers state change in my plugin, with some synchronization in the DB to ensure the former active plugin has stopped working. However I only use this on one part of my plugin (which is quite independant from the perf manager and reads data in opensm structs). For the perf manager part, I wasn't expecting openSM to report the perf counter when not in master mode (I didn't see anything like this in the original code from libskummee so I didn't quite think about it before now). I'll run some tests tomorrow. >> Due to the data I'm going to use, I need to have only one thread doing >> its job at a time. >> If there is only one plugin running on the Subnet it'll be fine, if >> there are several, I'll have to monitor the SM status to ensure only one >> of my threads is active. >> > > I think this is what you will have to do. The SM does not communicate the > plugin information to the other SM's on the Subnet. > > Hope this helps, > Ira > A lot. Thanks for the info ! Best Regards Nicolas From PHF at zurich.ibm.com Wed Jan 21 00:27:28 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Wed, 21 Jan 2009 09:27:28 +0100 Subject: [ofa-general] QP needs a lot of memory In-Reply-To: <4974C986.1080502@opengridcomputing.com> References: <000001c97790$acad6310$82cd180a@amr.corp.intel.com> <4974C986.1080502@opengridcomputing.com> Message-ID: > >> I am running OFED 1.4 on a Chelsio T3 RNIC. > >> When I was trying to connect a large number of clients (several hundred), > >> I noticed that the server was running out of memory. One instance of a > >> whenever I do an rdma_create_qp(), I lose about 7MB of main memory. > >> This severly limits the scalability of my application. > >> > >> Is there a reason for that? > >> > > > > The amount of memory needed per QP is based on the send and > receive queue sizes, > > plus the number of SGEs. I don't know specific details about the Chelsio > > adapter itself to know if 7MB is high or not. > > > > > That seems very high. A T3 max depth QP should only use around 256KB of > dma coherent memory. The max depth CQ should be around 256KB too. So > something whacked if its consuming 7MB per QP... > > How are you measuring this memory usage? I measured it using the vmstat command. The 7MB are only a rough estimation. Since I was not aware of the fact that the size of a QP object is depending on the max QP init attrs, I assigned the highest possible values for each (16384 send WRs, 1023 recv WRs, 4 SGEs for send and recv each). When reducing these numbers to what is actually needed (2 send WRs, 2 recv WRs and 1 SGE each), the QP size shrank significantly and I was able to create many more QPs. Thanks, Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Jan 21 02:33:59 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 12:33:59 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH V2 1/3] Create a new library libibnetdisc In-Reply-To: <20081223164141.241dd3f0.weiny2@llnl.gov> References: <20081211162031.0c591f54.weiny2@llnl.gov> <1230056943.23747.21.camel@auk31.llnl.gov> <20081223184331.GL31213@obsidianresearch.com> <20081223164141.241dd3f0.weiny2@llnl.gov> Message-ID: <20090121103359.GA3479@sashak.voltaire.com> On 16:41 Tue 23 Dec , Ira Weiny wrote: > > +#define IBND_ERROR(...) \ > + { \ > + fprintf(stderr, "%s:%d; ", __FILE__, __LINE__); \ > + fprintf(stderr, __VA_ARGS__); \ > + } As far as know macro like this (using '##' for var args and without breaking this in two) will work fine with both gcc and VC: #define IBND_ERROR(fmt, ...) \ fprintf(stderr, "%s:%d: " fmt, __FILE__, __LINE__, ## __VA_ARGS__) Sasha From vlad at lists.openfabrics.org Wed Jan 21 03:10:12 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 21 Jan 2009 03:10:12 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090121-0200 daily build status Message-ID: <20090121111012.D650DE61131@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on i686 with linux-2.6.16 Build failed on i686 with linux-2.6.18 Build failed on i686 with linux-2.6.17 From sashak at voltaire.com Wed Jan 21 04:07:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 14:07:27 +0200 Subject: [ofa-general] Re: [PATCH V2 1/3] Create a new library libibnetdisc In-Reply-To: <20081223113449.7d6c629b.weiny2@llnl.gov> References: <20081211162031.0c591f54.weiny2@llnl.gov> <20081221152100.GN25208@sashak.voltaire.com> <20081223113449.7d6c629b.weiny2@llnl.gov> Message-ID: <20090121120727.GB3479@sashak.voltaire.com> Hi Ira, On 11:34 Tue 23 Dec , Ira Weiny wrote: > > > > What is the reason to redeclear custom NodeInfo and PortInfo structures? > > The original are defined by IBA and there are lot of utilities to work > > with them. Wouldn't it be better to use it as is? > > > > [DISCLAIMER] First I want to answer your questions directly. However, after > writing this information, discussing with Al, and thinking it > over. I think I see where you are coming from and I _may_ agree > with you. So after reading these responses please read my > thoughts regarding libibmad and this new lib. > > > I have 3 reasons I did it this way: > > 1) This is pretty much the way that ibnetdiscover did things > (By using mad_decode_field into these single fields) > > 2) This makes libibnetdisc only dependent on libibmad rather than the OpenSM > libs. We have had some people complain that the diags require opensm-* > things to be installed. (This assumes you want to use ib_port_info_t from > ib_types.h) I thought about using libibmad rather than OpenSM stuff (in libibnetdisc), so PortInfo, NodeInfo and other SM attributes will be represented just as raw data and libibmad (or any other MAD interpreter) will be used by diag/whatever tool which uses libibnetdesc for decoding. > 3) This structure is in host byte order and calls out each field > independently rather than having to have intimate knowledge of the > PortInfo wire packet. > > For example this is the code used in iblinkinfo with the above structure. > > > snprintf(link_str, 256, > "(%3s %s %6s/%8s)", > ibnd_linkwidth_str(port->info.link_width_active), > ibnd_linkspeed_str(port->info.link_speed_active, 1), > ibnd_linkstate_str(port->info.link_state), > ibnd_physstate_str(port->info.phys_state) > ); > > Here is the code if you use the ib_port_info_t from ib_types.h > > snprintf(link_str, 256, > "(%3s %s %6s/%8s)", > ibnd_linkwidth_str(port->info.link_width_active), > ibnd_linkspeed_str(IB_PORT_LINK_SPEED_ACTIVE_MASK(port->info.link_speed), 1), > ibnd_linkstate_str(IB_PORT_STATE_MASK(port->info.state_info1)), > ibnd_physstate_str(IB_PORT_PHYS_STATE_MASK(port->info.state_info2) >> IB_PORT_PHYS_STATE_SHIFT) > // ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > // This is particularly nasty compared to the above. > ); > > I no longer agree with reason 1 and 2. Ok. > However, reason 3 I believe is enough > justification to declare a new type. > > [DISCLAIMER] item 3 might be a mute point as well if you redefine what > libibnetdisc is supposed to be. See below. Ok. See below. > > > + * Str convert functions > > > + */ > > > +char *ibnd_linkwidth_str(int link_width); > > > +char *ibnd_linkstate_str(int link_state); > > > +char *ibnd_physstate_str(int phys_state); > > > +const char *ibnd_node_type_str(ibnd_node_t *node); > > > +const char *ibnd_node_type_str_short(ibnd_node_t *node); > > > +char *ibnd_linkspeed_str(int link_speed, int data_rate); > > > + /* if data_rate == 0 use "SDR", "DDR", etc. */ > > > + /* if data_rate == 1 use "2.5 Gbps", "5.0 Gbps", etc. */ > > > > Similar functions exist in libibmad. Why do we need another set? > > 2 reasons. > > 1) The strings returned are not compatible with the current output of > ibnetdiscover and iblinkinfo... I was trying to make sure that the > library returned string which were backwards compatible. That is > actually the reason for the extra "data_rate" parameter of linkspeed. > iblinkinfo and ibnetdiscover print this differently. :-( > > 2) But more importantly this is an ease of use issue. > > This: > > snprintf(link_str, 256, > "(%3s %s %6s/%8s)", > ibnd_linkwidth_str(port->info.link_width_active), > ibnd_linkspeed_str(port->info.link_speed_active, 1), > ibnd_linkstate_str(port->info.link_state), > ibnd_physstate_str(port->info.phys_state) > ); > > Becomes this: > > char buf[256]; > ... > snprintf(link_str, 256, > "(%3s %s %6s/%8s)", > mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, buf, 256, &port->info.link_width_active); > mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, buf, 256, &port->info.link_speed_active); > // ^^^^^^^^^^^^^^^^^ > // Not backwards compatible with the current ibnetdiscover > // as it prints the data as "2.5 Gbps" rather than "SDR" > mad_dump_val(IB_PORT_STATE_F, buf, 256, &port->info.link_state); > mad_dump_val(IB_PORT_PHYS_STATE_F, buf, 256, > &port->info.phys_state);); > > Users don't need to go look up in mad.h for the field enum to print > something they already have; "link_width_active". > > > Anyway, I think I am starting to see the difference in what we are thinking... > > The ibnd_*_str functions and the ibnd_port_info_t were designed based on > libibnetdisc being a "one stop shop" for this data. I envisioned this library > being a wrapper around lower level libraries which would abstract away some > details, something like this. > > +----------+ +----------+ > | diag1 | | diag2 | > +----------+ +----------+ > | | > +-----------------+ > | libibnetdisc | > +-----------------+ > | > +-----------------+ > | libibmad | > +-----------------+ > > I think what you had in mind was something like: > > +--------+ > -| diag 1 |- > / +--------+ \ > +-----------------+ +--------+ +-----------------+ > | libibnetdisc | -| diag 2 |--| libibmad | > +-----------------+ +--------+ +-----------------+ > \ / > -------------------------- > > In this case users of libibnetdisc might get back something like: > > typedef struct port { > uint64_t guid; > int portnum; > int ext_portnum; /* optional if != 0 external port num */ > ibnd_node_t *node; /* node this port belongs to */ > struct port *remoteport; /* null if SMA, or does not exist */ > void *port_info; /* or uint8_t port_info[port_info_size] */ > } ibnd_port_t; > > and decode port_info like this: > > uint32_t lid = mad_get_field(port->port_info, 0, IB_PORT_LID_F); > mad_dump_val(IB_PORT_LID_F, port->port_info, &lid); > > Is that what you are thinking? Yes. > If this is the case I don't think I object. I > think it makes the end user of libibnetdisc work harder but it does offer some > advantages, namely less redefinition and a bit more flexibility. More flexibility is even more important IMO. > That said, I would like to clean up the mad interface at the same time. Which is good thing by itself :) . I absolutely agree that we can improve our helpers. > Just > figuring out the examples to write in this email have taken a lot of time. I > don't think this is a good thing. > > Here are some examples: > > add something like: > > static inline char * > mad_snprint_field(uint8_t *buf, int base_offs, int field, > char *print_buf, int print_buf_size) > > Therefore the above could be used in a print statement like: > > char tmp[256]; > printf("lid %s\n", mad_snprint_field(port->port_info, 0, IB_PORT_LID_F, tmp, > 256)); > > [Although lid is a bad example since it could be done with "%d"... But you > get what I mean.] Agree. > > And along those lines the difference between mad_dump_field and mad_dump_val > needs to be made more clear. They have the same signature but one has a lot of > formating added to it which I don't think is appropriate at this level. > > "LinkState:.......................Active" > vs. > "Active" Why not? I think it was designed to be just different levels. Maybe you are about unclear naming? > Also, I don't think that the following declarations need to be public. > > /* dump.c */ > ib_mad_dump_fn > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > mad_dump_bitfield, mad_dump_array, mad_dump_string, > mad_dump_linkwidth, mad_dump_linkwidthsup, mad_dump_linkwidthen, > mad_dump_linkdowndefstate, > mad_dump_linkspeed, mad_dump_linkspeedsup, mad_dump_linkspeeden, > mad_dump_portstate, mad_dump_portstates, > mad_dump_physportstate, mad_dump_portcapmask, > mad_dump_mtu, mad_dump_vlcap, mad_dump_opervls, > mad_dump_node_type, > mad_dump_sltovl, mad_dump_vlarbitration, > mad_dump_nodedesc, mad_dump_nodeinfo, mad_dump_portinfo, mad_dump_switchinfo, > mad_dump_perfcounters, mad_dump_perfcounters_ext; Again, why not? Those helpers are widely used now in infiniband-diags. > > int _mad_dump(ib_mad_dump_fn *fn, char *name, void *val, int valsz); > char * _mad_dump_field(ib_field_t *f, char *name, char *buf, int bufsz, > void *val); > int _mad_print_field(ib_field_t *f, char *name, void *val, int valsz); > char * _mad_dump_val(ib_field_t *f, char *buf, int bufsz, void *val); > > They confuse the ibmad layer. I think there are two (at least) levels - lower is for more flexibility and higher is for easy to use. Of course I agree that some improvements could be very useful - for example doing 'ib_mad_dump_fn' to return number of printed characters (similar to proposed mad_snprint_field()). > If this is what you would like I will rework the library. If we are agreed. > Perhaps starting to > clean up libibmad along the way? In parallel... :) Sasha From sashak at voltaire.com Wed Jan 21 06:47:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 16:47:37 +0200 Subject: [ofa-general] Re: [PATCH 0/3 - no ibcommon] Resubmit libibnetdiscover patches without libibcommon dependancy In-Reply-To: <20090109154742.07016fa3.weiny2@llnl.gov> References: <20090109154742.07016fa3.weiny2@llnl.gov> Message-ID: <20090121144737.GC3479@sashak.voltaire.com> Hi Ira, On 15:47 Fri 09 Jan , Ira Weiny wrote: > I wanted to ping you about the status of these patches but then realized that > they would have to be regenerated without libibcommon. Sorry about long delay. I answered your original email. Sasha From sashak at voltaire.com Wed Jan 21 06:49:23 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 16:49:23 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH 1/3 - no ibcommon] Create a new library libibnetdisc In-Reply-To: <20090109154749.4b19c8bf.weiny2@llnl.gov> References: <20090109154749.4b19c8bf.weiny2@llnl.gov> Message-ID: <20090121144923.GD3479@sashak.voltaire.com> On 15:47 Fri 09 Jan , Ira Weiny wrote: > > Signed-off-by: weiny2 at llnl.gov Some comments... > diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am > index c22ba5e..8e8c3c1 100644 > --- a/infiniband-diags/Makefile.am > +++ b/infiniband-diags/Makefile.am > @@ -1,3 +1,4 @@ > +SUBDIRS = libibnetdisc > > INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband > > diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in > index 58eea0a..c5b437d 100644 > --- a/infiniband-diags/configure.in > +++ b/infiniband-diags/configure.in > @@ -48,7 +48,7 @@ fi > > dnl Checks for header files. > AC_HEADER_STDC > -AC_CHECK_HEADERS([stdlib.h string.h unistd.h fcntl.h inttypes.h netinet/in.h sys/ioctl.h syslog.h]) > +AC_CHECK_HEADERS([stdlib.h string.h unistd.h fcntl.h inttypes.h netinet/in.h sys/ioctl.h]) > if test "$disable_libcheck" != "yes" > then > AC_CHECK_HEADER(infiniband/umad.h, [], > @@ -70,7 +70,7 @@ AC_C_CONST > dnl Check if we should include test utilities > AC_MSG_CHECKING(for --enable-test-utils) > AC_ARG_ENABLE(test-utils, > -[ --enable-test-utils build additional test utilities], > +[ --enable-test-utils build additional test utilities (default=no)], > [case "${enableval}" in > yes) tutils=yes ;; > no) tutils=no ;; > @@ -140,6 +140,23 @@ IBSCRIPTPATH_TMP2="`echo $IBSCRIPTPATH_TMP1 | sed 's/^NONE/$ac_default_prefix/'` > IBSCRIPTPATH="`eval echo $IBSCRIPTPATH_TMP2`" > AC_SUBST(IBSCRIPTPATH) > > +dnl Begin libibnetdisc stuff > +AC_CHECK_HEADERS([stdint.h]) BTW (not related to the patch), in current grouping.c code I see #include #include inttypes.h by itself includes stdint.h and as addition declares related printf formats such as PRIx64, etc.. So it is not really necessary to include both. > +AC_CHECK_FUNCS([strtoull]) > + > +ibnetdisc_api_version=`grep LIBVERSION $srcdir/libibnetdisc/libibnetdisc.ver | sed 's/LIBVERSION=//'` > +if test -z $ibnetdisc_api_version; then > + echo "FAILED to find $srcdir/libibnetdisc/libibnetdisc.ver" > + exit 1 > +fi > +AC_SUBST(ibnetdisc_api_version) > +AC_DEFINE_UNQUOTED(API_VERSION, > + ["$ibnetdisc_api_version"], > + [The API version of this library]) Where do you plan to use it (API_VERSION)? > + > +dnl End libibnetdisc stuff > + > + > AC_CONFIG_FILES([\ > Makefile \ > infiniband-diags.spec \ > @@ -160,6 +177,7 @@ AC_CONFIG_FILES([\ > scripts/ibhosts \ > scripts/ibnodes \ > scripts/ibswitches \ > - scripts/ibrouters > + scripts/ibrouters \ > + libibnetdisc/Makefile > ]) > AC_OUTPUT > diff --git a/infiniband-diags/libibnetdisc/Makefile.am b/infiniband-diags/libibnetdisc/Makefile.am > new file mode 100644 > index 0000000..8e0e16b > --- /dev/null > +++ b/infiniband-diags/libibnetdisc/Makefile.am > @@ -0,0 +1,62 @@ > + > +#SUBDIRS = . > + > +INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband > + > +lib_LTLIBRARIES = libibnetdisc.la > +sbin_PROGRAMS = > + > +if ENABLE_TEST_UTILS > +sbin_PROGRAMS += test/ibnetdisctest \ > + test/iblinkinfotest \ > + test/testleaks > +endif > + > +DBGFLAGS = -g > + > +if HAVE_LD_VERSION_SCRIPT > +libibnetdisc_version_script = -Wl,--version-script=$(srcdir)/src/libibnetdisc.map > +else > +libibnetdisc_version_script = > +endif > + > +libibnetdisc_la_SOURCES = src/ibnetdisc.c src/chassis.c src/chassis.h > +libibnetdisc_la_CFLAGS = -Wall $(DBGFLAGS) > +libibnetdisc_la_LDFLAGS = -version-info $(ibnetdisc_api_version) \ > + -export-dynamic $(libibnetdisc_version_script) \ > + -losmcomp -libmad > +libibnetdisc_la_DEPENDENCIES = $(srcdir)/src/libibnetdisc.map > + > +libibnetdiscincludedir = $(includedir)/infiniband > + > +test_ibnetdisctest_SOURCES = test/ibnetdisctest.c > +test_ibnetdisctest_CFLAGS = -Wall $(DBGFLAGS) > +test_ibnetdisctest_LDFLAGS = -libnetdisc > + > +test_iblinkinfotest_SOURCES = test/iblinkinfotest.c > +test_iblinkinfotest_CFLAGS = -Wall $(DBGFLAGS) > +test_iblinkinfotest_LDFLAGS = -libnetdisc > + > +test_testleaks_SOURCES = test/testleaks.c > +test_testleaks_CFLAGS = -Wall $(DBGFLAGS) > +test_testleaks_LDFLAGS = -libnetdisc > + > +libibnetdiscinclude_HEADERS = $(srcdir)/include/infiniband/ibnetdisc.h > + > +man_MANS = man/ibnd_debug.3 \ > + man/ibnd_destroy_fabric.3 \ > + man/ibnd_discover_fabric.3 \ > + man/ibnd_find_node_dr.3 \ > + man/ibnd_find_node_guid.3 \ > + man/ibnd_iter_nodes.3 \ > + man/ibnd_iter_nodes_type.3 \ > + man/ibnd_linkspeed_str.3 \ > + man/ibnd_linkstate_str.3 \ > + man/ibnd_linkwidth_str.3 \ > + man/ibnd_node_type_str.3 \ > + man/ibnd_physstate_str.3 \ > + man/ibnd_update_node.3 \ > + man/ibnd_show_progress.3 > + > +EXTRA_DIST = $(srcdir)/src/libibnetdisc.map libibnetdisc.ver autogen.sh > + > diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > new file mode 100644 > index 0000000..773c64b > --- /dev/null > +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > @@ -0,0 +1,282 @@ > +/* > + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#ifndef _IBNETDISC_H_ > +#define _IBNETDISC_H_ > + > +#include > +#include > + > +#define MAXHOPS 63 > + > +/* HASH table defines */ > +#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) > +#define HTSZ 137 This is used only in ./libibnetdisc/src/ibnetdisc.c, if so should it be in public header file? > + > +#define IBND_DEBUG(...) \ > + if (ibdebug) { \ > + printf("%s:%d; ", __FILE__, __LINE__); \ > + printf(__VA_ARGS__); \ > + } > +#define IBND_ERROR(...) \ > + { \ > + fprintf(stderr, "%s:%d; ", __FILE__, __LINE__); \ > + fprintf(stderr, __VA_ARGS__); \ > + } This can be done as #define IBND_ERROR(fmt, ...) \ fprintf(stderr, "%s:%u: " fmt, __FILE__, __LINE__, ## __VA_ARGS__) > + > +/** ========================================================================= > + * ENUM definitions > + */ > +typedef enum { > + IBND_CA_NODE = 1, > + IBND_SWITCH_NODE = 2, > + IBND_ROUTER_NODE = 3 > +} ibnd_node_type_t; infiniband/mad.h has IB_NODE_CA, IB_NODE_SWITCH, IB_NODE_ROUTER definitions already. > + > +typedef enum { > + IBND_LINK_DOWN = 1, > + IBND_LINK_INIT = 2, > + IBND_LINK_ARMED = 3, > + IBND_LINK_ACTIVE = 4 > +} ibnd_link_state_t; > + > +/** ========================================================================= > + * Node > + */ > +typedef struct switch_info { > + int smaenhsp0; > +} ibnd_switch_info_t; > + > +typedef struct node_info { > + int base_ver; > + int class_ver; > + int type; > + int numports; > + uint64_t sysimgguid; > + uint64_t nodeguid; > + uint64_t nodeportguid; > + uint16_t partition_cap; > + uint32_t devid; > + uint32_t revision; > + int localport; > + uint32_t vendid; > +} ibnd_node_info_t; Discussed in another thread about those definitions... > + > +struct ib_fabric; /* forward declare */ > +struct chassis; /* forward declare */ > +struct port; /* forward declare */ > + > +typedef struct node { > + struct node *next; /* all node list in fabric */ > + struct ib_fabric *fabric; /* the fabric node belongs to */ > + > + ib_portid_t path_portid; /* path from "from_node" */ > + int dist; /* num of hops from "from_node" */ > + int smalid; > + int smalmc; > + ibnd_switch_info_t sw_info; > + ibnd_node_info_t info; > + char nodedesc[64]; > + struct port **ports; /* in order array of port pointers */ > + /* the size of this array is info.numports + 1 */ > + /* items MAY BE NULL! (ie 0 == switches only) */ > + > + /* chassis info */ > + struct node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ > + struct chassis *chassis; /* if != NULL the chassis this node belongs to */ > + unsigned char ch_type; > + unsigned char ch_anafanum; > + unsigned char ch_slotnum; > + unsigned char ch_slot; > +} ibnd_node_t; > + > +/** ========================================================================= > + * Port > + */ > +typedef struct port_info { > + int base_lid; > + int smlid; > + int link_speed_supported; > + int link_speed_enabled; > + int link_speed_active; > + int port_state; > + int phys_state; > + int link_down_def_state; > + int mkey_prot_bits; > + int lmc; > + int neighbor_mtu; > + int smsl; > + int init_type; > + int vl_capability; > + int vl_high_limit; > + int vl_arb_high_cap; > + int vl_arb_low_cap; > + int init_reply; > + int mtu_cap; > + int vl_stall_count; > + int hoq_lifetime; > + int oper_vls; > + int partition_enforce_in; > + int partition_enforce_out; > + int filter_raw_in; > + int filter_raw_out; > + int mkey_violations; > + int pkey_violations; > + int qkey_violations; > + int guid_capabilities; > + int client_rereg; > + int subnet_timeout; > + int response_time_val; > + int local_phys_error; > + int overrun_error; > + int max_credit_hint; > + uint32_t link_round_trip; > + int local_port; > + int link_width_supported; > + int link_width_enabled; > + int link_width_active; > + int diag_code; > + int mkey_lease; > + uint32_t capability_mask; > + uint64_t mkey; > + uint64_t gid_prefix; > +} ibnd_port_info_t; > + > +typedef struct port { > + uint64_t guid; > + int portnum; > + int ext_portnum; /* optional if != 0 external port num */ > + ibnd_node_t *node; /* node this port belongs to */ > + ibnd_port_info_t info; > + struct port *remoteport; /* null if SMA, or does not exist */ > +} ibnd_port_t; > + > + > +/** ========================================================================= > + * Chassis data > + */ > +typedef struct chassis { > + struct chassis *next; > + uint64_t chassisguid; > + int chassisnum; > + > + /* generic grouping by SystemImageGUID */ > + int nodecount; > + ibnd_node_t *nodes; > + > + /* specific to voltaire type nodes */ > +#define SPINES_MAX_NUM 12 > +#define LINES_MAX_NUM 36 > + ibnd_node_t *spinenode[SPINES_MAX_NUM + 1]; > + ibnd_node_t *linenode[LINES_MAX_NUM + 1]; > +} ibnd_chassis_t; > + > +/** ========================================================================= > + * Fabric > + * Main fabric object which is returned and represents the data discovered > + */ > +typedef struct ib_fabric { > + /* the node the discover was initiated from > + * "from" parameter in ibnd_discover_fabric > + * or by default the node you ar running on > + */ > + ibnd_node_t *from_node; > + /* NULL term list of all nodes in the fabric */ > + ibnd_node_t *nodes; > + /* NULL terminated list of all chassis found in the fabric */ > + ibnd_chassis_t *chassis; > + int maxhops_discovered; > +} ibnd_fabric_t; > + > + > +/** ========================================================================= > + * Initialization (fabric operations) > + */ > +void ibnd_debug(int i); > +void ibnd_show_progress(int i); > + > +ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, > + int timeout_ms, ib_portid_t *from, int hops); > + /** > + * dev_name: (required) local device name to use to access the fabric > + * dev_port: (required) local device port to use to access the fabric > + * timeout_ms: (required) gives the timeout for a _SINGLE_ query on > + * the fabric. So if there are mutiple nodes not > + * responding this may result in a lengthy delay. > + * from: (optional) specify the node to start scanning from. > + * If NULL start from the node we are running on. > + * hops: (optional) Specify how much of the fabric to traverse. > + * negative value == scan entire fabric > + */ > +void ibnd_destroy_fabric(ibnd_fabric_t *fabric); > + > +/** ========================================================================= > + * Node operations > + */ > +ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid); > +ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str); > +ibnd_node_t *ibnd_update_node(ibnd_node_t *node); > + > +typedef void (*ibnd_iter_node_func_t)(ibnd_node_t *node, void *user_data); > +void ibnd_iter_nodes(ibnd_fabric_t *fabric, > + ibnd_iter_node_func_t func, > + void *user_data); > +void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, > + ibnd_iter_node_func_t func, > + ibnd_node_type_t node_type, > + void *user_data); > + > +/** ========================================================================= > + * Str convert functions > + */ > +char *ibnd_linkwidth_str(int link_width); > +char *ibnd_linkstate_str(int link_state); > +char *ibnd_physstate_str(int phys_state); > +const char *ibnd_node_type_str(ibnd_node_t *node); > +const char *ibnd_node_type_str_short(ibnd_node_t *node); > +char *ibnd_linkspeed_str(int link_speed, int data_rate); > + /* if data_rate == 0 use "SDR", "DDR", etc. */ > + /* if data_rate == 1 use "2.5 Gbps", "5.0 Gbps", etc. */ > + > +/** ========================================================================= > + * Chassis queries > + */ > +uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum); > +char *ibnd_get_chassis_type(ibnd_node_t *node); > +char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size); > + > +int ibnd_is_xsigo_guid(uint64_t guid); > +int ibnd_is_xsigo_tca(uint64_t guid); > +int ibnd_is_xsigo_hca(uint64_t guid); > + > +#endif /* _IBNETDISC_H_ */ [snip...] > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > new file mode 100644 > index 0000000..29f691c > --- /dev/null > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > @@ -0,0 +1,871 @@ > +/* > + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > + * Copyright (c) 2008 Lawrence Livermore National Laboratory > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#define _GNU_SOURCE > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > + > +#include > +#include > + > +#include "internal.h" > +#include "chassis.h" > + > +static int timeout_ms = 2000; > +static int show_progress = 0; > + > +static char *linkwidth_str[] = { > + "??", > + "1x", > + "4x", > + "??", > + "8x", > + "??", > + "??", > + "??", > + "12x" > +}; > + > +static char *linkspeed_str[] = { > + "???", > + "SDR", > + "DDR", > + "???", > + "QDR" > +}; > + > +static char *linkspeed_datarate_str[] = { > + "???", > + "2.5 Gbps", > + "5.0 Gbps", > + "???", > + "10.0 Gbps" > +}; > + > +static char *linkstate_str[] = { > + "No State", > + "Down", > + "Init", > + "Armed", > + "Active" > +}; > + > +static char *physstate_str[] = { > + "No State", > + "Sleep", > + "Polling", > + "Disabled", > + "PortConfigTraining", > + "LinkUp", > + "LinkErrorRecovery", > + "Phy Test" > +}; > + > +char * > +ibnd_linkwidth_str(int link_width) > +{ > + if (link_width > 8) > + return linkwidth_str[0]; > + else > + return linkwidth_str[link_width]; > +} > + > +char * > +ibnd_linkspeed_str(int link_speed, int data_rate) > +{ > + if (link_speed > 4) > + return linkspeed_str[0]; > + else if (data_rate) > + return linkspeed_datarate_str[link_speed]; > + else > + return linkspeed_str[link_speed]; > +} > +char * > +ibnd_linkstate_str(int link_state) > +{ > + if (link_state > 4) > + return linkstate_str[0]; > + else > + return linkstate_str[link_state]; > +} > + > +char * > +ibnd_physstate_str(int phys_state) > +{ > + if (phys_state > 7) > + return physstate_str[0]; > + else > + return physstate_str[phys_state]; > +} > + > +void > +decode_port_info(void * rcv_buf, ibnd_port_info_t *pi) > +{ > + mad_decode_field(rcv_buf, IB_PORT_LID_F, &pi->base_lid); > + mad_decode_field(rcv_buf, IB_PORT_SMLID_F, &pi->smlid); > + > + mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_SUPPORTED_F, &pi->link_speed_supported); > + mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_ENABLED_F, &pi->link_speed_enabled); > + mad_decode_field(rcv_buf, IB_PORT_LINK_SPEED_ACTIVE_F, &pi->link_speed_active); > + > + mad_decode_field(rcv_buf, IB_PORT_LOCAL_PORT_F, &pi->local_port); > + mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_SUPPORTED_F, &pi->link_width_supported); > + mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_ENABLED_F, &pi->link_width_enabled); > + > + mad_decode_field(rcv_buf, IB_PORT_LINK_WIDTH_ACTIVE_F, &pi->link_width_active); > + > + mad_decode_field(rcv_buf, IB_PORT_DIAG_F, &pi->diag_code); > + mad_decode_field(rcv_buf, IB_PORT_MKEY_LEASE_F, &pi->mkey_lease); > + mad_decode_field(rcv_buf, IB_PORT_CAPMASK_F, &pi->capability_mask); > + mad_decode_field(rcv_buf, IB_PORT_MKEY_F, &pi->mkey); > + mad_decode_field(rcv_buf, IB_PORT_GID_PREFIX_F, &pi->gid_prefix); > + > + mad_decode_field(rcv_buf, IB_PORT_STATE_F, &pi->port_state); > + mad_decode_field(rcv_buf, IB_PORT_PHYS_STATE_F, &pi->phys_state); > + > + mad_decode_field(rcv_buf, IB_PORT_LINK_DOWN_DEF_F, &pi->link_down_def_state); > + mad_decode_field(rcv_buf, IB_PORT_MKEY_PROT_BITS_F, &pi->mkey_prot_bits); > + > + mad_decode_field(rcv_buf, IB_PORT_LMC_F, &pi->lmc); > + mad_decode_field(rcv_buf, IB_PORT_NEIGHBOR_MTU_F, &pi->neighbor_mtu); > + mad_decode_field(rcv_buf, IB_PORT_SMSL_F, &pi->smsl); > + mad_decode_field(rcv_buf, IB_PORT_INIT_TYPE_F, &pi->init_type); > + > + mad_decode_field(rcv_buf, IB_PORT_VL_CAP_F, &pi->vl_capability); > + mad_decode_field(rcv_buf, IB_PORT_VL_HIGH_LIMIT_F, &pi->vl_high_limit); > + mad_decode_field(rcv_buf, IB_PORT_VL_ARBITRATION_HIGH_CAP_F, &pi->vl_arb_high_cap); > + mad_decode_field(rcv_buf, IB_PORT_VL_ARBITRATION_LOW_CAP_F, &pi->vl_arb_low_cap); > + > + mad_decode_field(rcv_buf, IB_PORT_INIT_TYPE_REPLY_F, &pi->init_reply); > + mad_decode_field(rcv_buf, IB_PORT_MTU_CAP_F, &pi->mtu_cap); > + mad_decode_field(rcv_buf, IB_PORT_VL_STALL_COUNT_F, &pi->vl_stall_count); > + mad_decode_field(rcv_buf, IB_PORT_HOQ_LIFE_F, &pi->hoq_lifetime); > + mad_decode_field(rcv_buf, IB_PORT_OPER_VLS_F, &pi->oper_vls); > + mad_decode_field(rcv_buf, IB_PORT_PART_EN_INB_F, &pi->partition_enforce_in); > + mad_decode_field(rcv_buf, IB_PORT_PART_EN_OUTB_F, &pi->partition_enforce_out); > + mad_decode_field(rcv_buf, IB_PORT_FILTER_RAW_INB_F, &pi->filter_raw_in); > + mad_decode_field(rcv_buf, IB_PORT_FILTER_RAW_OUTB_F, &pi->filter_raw_out); > + mad_decode_field(rcv_buf, IB_PORT_MKEY_VIOL_F, &pi->mkey_violations); > + mad_decode_field(rcv_buf, IB_PORT_PKEY_VIOL_F, &pi->pkey_violations); > + mad_decode_field(rcv_buf, IB_PORT_QKEY_VIOL_F, &pi->qkey_violations); > + > + mad_decode_field(rcv_buf, IB_PORT_GUID_CAP_F, &pi->guid_capabilities); > + > + mad_decode_field(rcv_buf, IB_PORT_CLIENT_REREG_F, &pi->client_rereg); > + mad_decode_field(rcv_buf, IB_PORT_SUBN_TIMEOUT_F, &pi->subnet_timeout); > + mad_decode_field(rcv_buf, IB_PORT_RESP_TIME_VAL_F, &pi->response_time_val); > + mad_decode_field(rcv_buf, IB_PORT_LOCAL_PHYS_ERR_F, &pi->local_phys_error); > + mad_decode_field(rcv_buf, IB_PORT_OVERRUN_ERR_F, &pi->overrun_error); > + mad_decode_field(rcv_buf, IB_PORT_MAX_CREDIT_HINT_F, &pi->max_credit_hint); > + mad_decode_field(rcv_buf, IB_PORT_LINK_ROUND_TRIP_F, &pi->link_round_trip); > +} > + > +static int > +get_port_info(struct ibnd_fabric *fabric, struct ibnd_port *port, > + int portnum, ib_portid_t *portid) > +{ > + char portinfo[64]; > + void *pi = portinfo; > + > + port->port.portnum = portnum; > + > + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout_ms, > + fabric->ibmad_port)) > + return -1; > + > + decode_port_info(pi, &port->port.info); > + > + IBND_DEBUG("portid %s portnum %d: base lid %d state %d physstate %d %s %s\n", > + portid2str(portid), portnum, port->port.info.base_lid, port->port.info.port_state, > + port->port.info.phys_state, ibnd_linkwidth_str(port->port.info.link_width_active), > + ibnd_linkspeed_str(port->port.info.link_speed_active, 0)); > + return 1; Maybe better to use '0' as successful status (similar to another functions like query_node_info())? > +} > + > +static void > +decode_node_info(void * rcv_buf, ibnd_node_info_t *ni) > +{ > + mad_decode_field(rcv_buf, IB_NODE_BASE_VERS_F, &ni->base_ver); > + mad_decode_field(rcv_buf, IB_NODE_CLASS_VERS_F, &ni->class_ver); > + mad_decode_field(rcv_buf, IB_NODE_TYPE_F, &ni->type); > + mad_decode_field(rcv_buf, IB_NODE_NPORTS_F, &ni->numports); > + mad_decode_field(rcv_buf, IB_NODE_SYSTEM_GUID_F, &ni->sysimgguid); > + mad_decode_field(rcv_buf, IB_NODE_GUID_F, &ni->nodeguid); > + mad_decode_field(rcv_buf, IB_NODE_PORT_GUID_F, &ni->nodeportguid); > + mad_decode_field(rcv_buf, IB_NODE_PARTITION_CAP_F, &ni->partition_cap); > + mad_decode_field(rcv_buf, IB_NODE_DEVID_F, &ni->devid); > + mad_decode_field(rcv_buf, IB_NODE_REVISION_F, &ni->revision); > + mad_decode_field(rcv_buf, IB_NODE_LOCAL_PORT_F, &ni->localport); > + mad_decode_field(rcv_buf, IB_NODE_VENDORID_F, &ni->vendid); > +} > + > +/* > + * Returns -1 if error. > + */ > +static int > +query_node_info(struct ibnd_fabric *fabric, struct ibnd_node *node, ib_portid_t *portid) > +{ > + char nodeinfo[64]; > + void *ni = nodeinfo; > + if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout_ms, > + fabric->ibmad_port)) > + return -1; > + decode_node_info(ni, &(node->node.info)); > + return (0); > +} > + > +/* > + * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. > + */ > +static int > +query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, > + struct ibnd_port *iport, ib_portid_t *portid) > +{ > + char portinfo[64]; > + void *pi = portinfo; > + char switchinfo[64]; > + void *si = switchinfo; > + ibnd_node_t *node = &(inode->node); > + ibnd_port_t *port = &(iport->port); > + void *nd = inode->node.nodedesc; > + > + if (query_node_info(fabric, inode, portid)) > + return -1; > + > + port->portnum = node->info.localport; > + port->guid = node->info.nodeportguid; > + > + if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout_ms, > + fabric->ibmad_port)) > + return -1; > + > + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout_ms, > + fabric->ibmad_port)) > + return -1; > + decode_port_info(pi, &port->info); > + > + if (node->info.type != IBND_SWITCH_NODE) > + return 0; > + > + node->smalid = port->info.base_lid; > + node->smalmc = port->info.lmc; > + > + /* after we have the sma information find out the real PortInfo for this port */ > + if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, node->info.localport, timeout_ms, > + fabric->ibmad_port)) > + return -1; > + decode_port_info(pi, &port->info); > + > + if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, > + fabric->ibmad_port)) > + node->sw_info.smaenhsp0 = 0; /* assume base SP0 */ > + else > + mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->sw_info.smaenhsp0); Somehow indentation is broken here. > + > + IBND_DEBUG("portid %s: got switch node %" PRIx64 " '%s'\n", > + portid2str(portid), node->info.nodeguid, node->nodedesc); > + return 1; Same about return status. > +} > + > +static int > +add_port_to_dpath(ib_dr_path_t *path, int nextport) > +{ > + if (path->cnt+2 >= sizeof(path->p)) > + return -1; > + ++path->cnt; > + path->p[path->cnt] = nextport; > + return path->cnt; > +} > + > +static int > +extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) > +{ > + int rc = add_port_to_dpath(path, nextport); > + if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) > + f->fabric.maxhops_discovered = path->cnt; > + return (rc); > +} > + > +static void > +dump_endnode(ib_portid_t *path, char *prompt, > + struct ibnd_node *node, struct ibnd_port *port) > +{ > + if (!show_progress) > + return; > + > + printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", > + portid2str(path), prompt, > + ibnd_node_type_str((ibnd_node_t *)node), > + node->node.info.nodeguid, > + node->node.info.type == IBND_SWITCH_NODE ? 0 : port->port.portnum, > + port->port.info.base_lid, port->port.info.base_lid + (1 << port->port.info.lmc) - 1, > + node->node.nodedesc); > +} > + > +static struct ibnd_node * > +find_existing_node(struct ibnd_fabric *fabric, struct ibnd_node *new) > +{ > + int hash = HASHGUID(new->node.info.nodeguid) % HTSZ; > + struct ibnd_node *node; > + > + for (node = fabric->nodestbl[hash]; node; node = node->htnext) > + if (node->node.info.nodeguid == new->node.info.nodeguid) > + return node; > + > + return NULL; > +} > + > +ibnd_node_t * > +ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid) > +{ > + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); > + int hash = HASHGUID(guid) % HTSZ; > + struct ibnd_node *node; > + > + for (node = f->nodestbl[hash]; node; node = node->htnext) > + if (node->node.info.nodeguid == guid) > + return (ibnd_node_t *)node; > + > + return NULL; > +} > + > +ibnd_node_t * > +ibnd_update_node(ibnd_node_t *node) > +{ > + char portinfo[64]; > + void *pi = portinfo; > + ibnd_port_info_t port0_info; > + char switchinfo[64]; > + void *si = switchinfo; > + void *nd = node->nodedesc; > + int p = 0; > + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(node->fabric); > + struct ibnd_node *n = CONV_NODE_INTERNAL(node); > + > + if (query_node_info(f, n, &(n->node.path_portid))) > + return (NULL); > + > + if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, timeout_ms, > + f->ibmad_port)) > + return (NULL); > + > + /* update all the port info's */ > + for (p = 1; p >= n->node.info.numports; p++) { > + get_port_info(f, CONV_PORT_INTERNAL(n->node.ports[p]), p, &(n->node.path_portid)); > + } > + > + if (n->node.info.type != IBND_SWITCH_NODE) > + goto done; > + > + if (!smp_query_via(pi, &(n->node.path_portid), IB_ATTR_PORT_INFO, 0, timeout_ms, > + f->ibmad_port)) > + return (NULL); > + decode_port_info(pi, &port0_info); > + > + n->node.smalid = port0_info.base_lid; > + n->node.smalmc = port0_info.lmc; > + > + if (!smp_query_via(si, &(n->node.path_portid), IB_ATTR_SWITCH_INFO, 0, timeout_ms, > + f->ibmad_port)) > + node->sw_info.smaenhsp0 = 0; /* assume base SP0 */ > + else > + mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &n->node.sw_info.smaenhsp0); > + > +done: > + return (node); > +} > + > +ibnd_node_t * > +ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str) > +{ > + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); > + int i = 0; > + ibnd_node_t *rc = f->fabric.from_node; > + ib_dr_path_t path; > + > + if (str2drpath(&path, dr_str, 0, 0) == -1) { > + return (NULL); > + } > + > + for (i = 0; i <= path.cnt; i++) { > + ibnd_port_t *remote_port = NULL; > + if (path.p[i] == 0) > + continue; > + if (!rc->ports) > + return (NULL); > + > + remote_port = rc->ports[path.p[i]]->remoteport; > + if (!remote_port) > + return (NULL); > + > + rc = remote_port->node; > + } > + > + return (rc); > +} > + > +static void > +add_to_nodeguid_hash(struct ibnd_node *node, struct ibnd_node *hash[]) > +{ > + int hash_idx = HASHGUID(node->node.info.nodeguid) % HTSZ; > + > + node->htnext = hash[hash_idx]; > + hash[hash_idx] = node; > +} > + > +static void > +add_to_portguid_hash(struct ibnd_port *port, struct ibnd_port *hash[]) > +{ > + int hash_idx = HASHGUID(port->port.guid) % HTSZ; > + > + port->htnext = hash[hash_idx]; > + hash[hash_idx] = port; > +} > + > +static void > +add_to_type_list(struct ibnd_node*node, struct ibnd_fabric *fabric) > +{ > + switch (node->node.info.type) { > + case IBND_CA_NODE: > + node->type_next = fabric->ch_adapters; > + fabric->ch_adapters = node; > + break; > + case IBND_SWITCH_NODE: > + node->type_next = fabric->switches; > + fabric->switches = node; > + break; > + case IBND_ROUTER_NODE: > + node->type_next = fabric->routers; > + fabric->routers = node; > + break; > + } > +} > + > +static void > +add_to_nodedist(struct ibnd_node *node, struct ibnd_fabric *fabric) > +{ > + int dist = node->node.dist; > + if (node->node.info.type != IBND_SWITCH_NODE) > + dist = MAXHOPS; /* special Ca list */ > + > + node->dnext = fabric->nodesdist[dist]; > + fabric->nodesdist[dist] = node; > +} > + > + > +static struct ibnd_node * > +create_node(struct ibnd_fabric *fabric, struct ibnd_node *temp, ib_portid_t *path, int dist) > +{ > + struct ibnd_node *node; > + > + node = malloc(sizeof(*node)); > + if (!node) { > + IBPANIC("OOM: node creation failed\n"); > + return NULL; > + } > + > + memcpy(node, temp, sizeof(*node)); > + node->node.dist = dist; > + node->node.path_portid = *path; > + node->node.fabric = (ibnd_fabric_t *)fabric; > + > + add_to_nodeguid_hash(node, fabric->nodestbl); > + > + /* add this to the all nodes list */ > + node->node.next = fabric->fabric.nodes; > + fabric->fabric.nodes = (ibnd_node_t *)node; > + > + add_to_type_list(node, fabric); > + add_to_nodedist(node, fabric); > + > + return node; > +} > + > +static struct ibnd_port * > +find_existing_port_node(struct ibnd_node *node, struct ibnd_port *port) > +{ > + if (port->port.portnum > node->node.info.numports || node->node.ports == NULL ) > + return (NULL); > + > + return (CONV_PORT_INTERNAL(node->node.ports[port->port.portnum])); > +} > + > +static struct ibnd_port * > +add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_port *temp) > +{ > + struct ibnd_port *port; > + > + port = malloc(sizeof(*port)); > + if (!port) > + return NULL; > + > + memcpy(port, temp, sizeof(*port)); > + port->port.node = (ibnd_node_t *)node; > + port->port.ext_portnum = 0; > + > + if (node->node.ports == NULL) { > + node->node.ports = calloc(sizeof(*node->node.ports), node->node.info.numports + 1); > + if (!node->node.ports) { > + IBND_ERROR("Failed to allocate the ports array\n"); > + return (NULL); > + } > + } > + > + node->node.ports[temp->port.portnum] = (ibnd_port_t *)port; > + > + add_to_portguid_hash(port, fabric->portstbl); > + return port; > +} > + > +static void > +link_ports(struct ibnd_node *node, struct ibnd_port *port, > + struct ibnd_node *remotenode, struct ibnd_port *remoteport) > +{ > + IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u\n", > + node->node.info.nodeguid, node, port, port->port.portnum, > + remotenode->node.info.nodeguid, remotenode, > + remoteport, remoteport->port.portnum); > + if (port->port.remoteport) > + port->port.remoteport->remoteport = NULL; > + if (remoteport->port.remoteport) > + remoteport->port.remoteport->remoteport = NULL; > + port->port.remoteport = (ibnd_port_t *)remoteport; > + remoteport->port.remoteport = (ibnd_port_t *)port; > +} > + > +static int > +get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_port *port, ib_portid_t *path, > + int portnum, int dist) > +{ > + struct ibnd_node node_buf; > + struct ibnd_port port_buf; > + struct ibnd_node *remotenode, *oldnode; > + struct ibnd_port *remoteport, *oldport; > + > + memset(&node_buf, 0, sizeof(node_buf)); > + memset(&port_buf, 0, sizeof(port_buf)); > + > + IBND_DEBUG("handle node %p port %p:%d dist %d\n", node, port, portnum, dist); > + if (port->port.info.phys_state != 5) /* LinkUp */ > + return -1; > + > + if (extend_dpath(fabric, &path->drpath, portnum) < 0) > + return -1; > + > + if (query_node(fabric, &node_buf, &port_buf, path) < 0) { > + IBWARN("NodeInfo on %s failed, skipping port", > + portid2str(path)); > + path->drpath.cnt--; /* restore path */ > + return -1; > + } > + > + oldnode = find_existing_node(fabric, &node_buf); > + if (oldnode) > + remotenode = oldnode; > + else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) > + IBPANIC("no memory"); > + > + oldport = find_existing_port_node(remotenode, &port_buf); > + if (oldport) { > + remoteport = oldport; > + } else if (!(remoteport = add_port_to_node(fabric, remotenode, &port_buf))) > + IBPANIC("no memory"); > + > + dump_endnode(path, oldnode ? "known remote" : "new remote", > + remotenode, remoteport); > + > + link_ports(node, port, remotenode, remoteport); > + > + path->drpath.cnt--; /* restore path */ > + return 0; > +} > + > +static void * > +ibnd_init_port(char *dev_name, int dev_port) > +{ > + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; > + > + /* Crank up the mad lib */ > + return (mad_rpc_open_port(dev_name, dev_port, mgmt_classes, 2)); > +} > + > +ibnd_fabric_t * > +ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, > + ib_portid_t *from, int hops) > +{ > + struct ibnd_fabric *fabric = NULL; > + ib_portid_t my_portid = {0}; > + struct ibnd_node node_buf; > + struct ibnd_port port_buf; > + struct ibnd_node *node; > + struct ibnd_port *port; > + int i; > + int dist = 0; > + ib_portid_t *path; > + int max_hops = MAXHOPS-1; /* default find everything */ > + > + /* if not everything how much? */ > + if (hops >= 0) { > + max_hops = hops; > + } > + > + /* If not specified start from "my" port */ > + if (!from) { > + from = &my_portid; > + } > + > + fabric = malloc(sizeof(*fabric)); > + > + if (!fabric) { > + IBPANIC("OOM: failed to malloc ibnd_fabric_t\n"); > + return (NULL); > + } > + > + memset(fabric, 0, sizeof(*fabric)); > + > + fabric->ibmad_port = ibnd_init_port(dev_name, dev_port); > + if (!fabric->ibmad_port) { > + IBPANIC("OOM: failed to open \"%s\" port %d\n", > + dev_name, dev_port); > + goto error; > + } > + > + IBND_DEBUG("from %s\n", portid2str(from)); > + > + memset(&node_buf, 0, sizeof(node_buf)); > + memset(&port_buf, 0, sizeof(port_buf)); > + > + if (query_node(fabric, &node_buf, &port_buf, from) < 0) { > + IBWARN("can't reach node %s\n", portid2str(from)); > + goto error; > + } > + > + node = create_node(fabric, &node_buf, from, 0); > + if (!node) > + goto error; > + > + fabric->fabric.from_node = (ibnd_node_t *)node; > + > + port = add_port_to_node(fabric, node, &port_buf); > + if (!port) > + IBPANIC("out of memory"); > + > + if (node->node.info.type != IBND_SWITCH_NODE && > + get_remote_node(fabric, node, port, from, node->node.info.localport, 0) < 0) > + return ((ibnd_fabric_t *)fabric); > + > + for (dist = 0; dist <= max_hops; dist++) { > + > + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { > + > + path = &node->node.path_portid; > + > + IBND_DEBUG("dist %d node %p\n", dist, node); > + dump_endnode(path, "processing", node, port); > + > + for (i = 1; i <= node->node.info.numports; i++) { > + if (i == node->node.info.localport) > + continue; > + > + if (get_port_info(fabric, &port_buf, i, path) < 0) { > + IBWARN("can't reach node %s port %d", portid2str(path), i); > + continue; > + } > + > + port = find_existing_port_node(node, &port_buf); > + if (port) > + continue; > + > + port = add_port_to_node(fabric, node, &port_buf); > + if (!port) > + IBPANIC("out of memory"); > + > + /* If switch, set port GUID to node port GUID */ > + if (node->node.info.type == IBND_SWITCH_NODE) > + port->port.guid = node->node.info.nodeportguid; > + > + get_remote_node(fabric, node, port, path, i, dist); > + } > + } > + } > + > + fabric->fabric.chassis = group_nodes(fabric); > + > + return ((ibnd_fabric_t *)fabric); > +error: > + free(fabric); > + return (NULL); > +} > + > +static void > +destroy_node(struct ibnd_node *node) > +{ > + int p = 0; > + > + for (p = 0; p <= node->node.info.numports; p++) { > + free(node->node.ports[p]); > + } > + free(node->node.ports); > + free(node); > +} > + > +void > +ibnd_destroy_fabric(ibnd_fabric_t *fabric) > +{ > + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); > + int dist = 0; > + struct ibnd_node *node = NULL; > + struct ibnd_node *next = NULL; > + ibnd_chassis_t *ch, *ch_next; > + > + ch = f->first_chassis; > + while (ch) { > + ch_next = ch->next; > + free(ch); > + ch = ch_next; > + } > + for (dist = 0; dist <= MAXHOPS; dist++) { > + node = f->nodesdist[dist]; > + while (node) { > + next = node->dnext; > + destroy_node(node); > + node = next; > + } > + } > + if (f->ibmad_port) > + mad_rpc_close_port(f->ibmad_port); > + free(f); > +} > + > +void > +ibnd_debug(int i) > +{ > + if (i) { > + ibdebug++; > + madrpc_show_errors(1); > + umad_debug(i); > + } else { > + ibdebug = 0; > + madrpc_show_errors(0); > + umad_debug(0); > + } > +} > + > +void > +ibnd_show_progress(int i) > +{ > + show_progress = i; > +} > + > +const char* > +ibnd_node_type_str(ibnd_node_t *node) > +{ > + switch(node->info.type) { > + case IBND_CA_NODE: return "Ca"; > + case IBND_SWITCH_NODE: return "Switch"; > + case IBND_ROUTER_NODE: return "Router"; > + } > + return "??"; > +} > + > +const char* > +ibnd_node_type_str_short(ibnd_node_t *node) > +{ > + switch(node->info.type) { > + case IBND_SWITCH_NODE: return "SW"; > + case IBND_CA_NODE: return "CA"; > + case IBND_ROUTER_NODE: return "RT"; > + } > + return "??"; > +} > + > + > +void > +ibnd_iter_nodes(ibnd_fabric_t *fabric, > + ibnd_iter_node_func_t func, > + void *user_data) > +{ > + ibnd_node_t *cur = NULL; > + > + for (cur = fabric->nodes; cur; cur = cur->next) { > + func(cur, user_data); > + } > +} > + > + > +void > +ibnd_iter_nodes_type(ibnd_fabric_t *fabric, > + ibnd_iter_node_func_t func, > + ibnd_node_type_t node_type, > + void *user_data) > +{ > + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); > + struct ibnd_node *list = NULL; > + struct ibnd_node *cur = NULL; > + > + switch (node_type) { > + case IBND_SWITCH_NODE: > + list = f->switches; > + break; > + case IBND_CA_NODE: > + list = f->ch_adapters; > + break; > + case IBND_ROUTER_NODE: > + list = f->routers; > + break; > + default: > + IBND_DEBUG("Invalid node_type specified %d\n", node_type); > + break; > + } > + > + for (cur = list; cur; cur = cur->type_next) { > + func((ibnd_node_t *)cur, user_data); > + } > +} > + > diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h > new file mode 100644 > index 0000000..89f238f > --- /dev/null > +++ b/infiniband-diags/libibnetdisc/src/internal.h > @@ -0,0 +1,82 @@ > +/* > + * Copyright (c) 2008 Lawrence Livermore National Laboratory > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +/** ========================================================================= > + * Define the internal data structures. > + */ > + > +#ifndef _INTERNAL_H_ > +#define _INTERNAL_H_ > + > +#include > + > +struct ibnd_node { > + /* This member MUST BE FIRST */ > + ibnd_node_t node; > + > + /* internal use only */ > + unsigned char ch_found; > + struct ibnd_node *htnext; /* hash table list */ > + struct ibnd_node *dnext; /* nodesdist next */ > + struct ibnd_node *type_next; /* next based on type */ > +}; > +#define CONV_NODE_INTERNAL(node) ((struct ibnd_node *)node) > + > +struct ibnd_port { > + /* This member MUST BE FIRST */ > + ibnd_port_t port; > + > + /* internal use only */ > + struct ibnd_port *htnext; > +}; > +#define CONV_PORT_INTERNAL(port) ((struct ibnd_port *)port) > + > +struct ibnd_fabric { > + /* This member MUST BE FIRST */ > + ibnd_fabric_t fabric; > + > + /* internal use only */ > + void *ibmad_port; > + struct ibnd_node *nodestbl[HTSZ]; > + struct ibnd_port *portstbl[HTSZ]; > + struct ibnd_node *nodesdist[MAXHOPS+1]; > + ibnd_chassis_t *first_chassis; > + ibnd_chassis_t *current_chassis; > + ibnd_chassis_t *last_chassis; > + struct ibnd_node *switches; > + struct ibnd_node *ch_adapters; > + struct ibnd_node *routers; > +}; > +#define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) Why should we hide this internal data so hard? Maybe we can use a single structures, to mark those fields (in comment) as "for internal use" - somebody may want to use it. > + > +#endif /* _INTERNAL_H_ */ > diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map > new file mode 100644 > index 0000000..5e8c315 > --- /dev/null > +++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map > @@ -0,0 +1,27 @@ > +IBNETDISC_1.0 { > + global: > + ibnd_debug; > + ibnd_show_progress; > + ibnd_discover_fabric; > + ibnd_cache_fabric; > + ibnd_read_fabric; > + ibnd_destroy_fabric; > + ibnd_find_node_guid; > + ibnd_update_node; Where is ibnd_update_node() useful (in public API)? > + ibnd_find_node_dr; > + ibnd_linkwidth_str; > + ibnd_linkspeed_str; > + ibnd_node_type_str; > + ibnd_node_type_str_short; > + ibnd_is_xsigo_guid; > + ibnd_is_xsigo_tca; > + ibnd_is_xsigo_hca; > + ibnd_get_chassis_guid; > + ibnd_get_chassis_type; > + ibnd_get_chassis_slot_str; > + ibnd_linkstate_str; > + ibnd_physstate_str; > + ibnd_iter_nodes; > + ibnd_iter_nodes_type; > + local: *; > +}; Sasha From alexxy at gentoo.org Wed Jan 21 06:55:21 2009 From: alexxy at gentoo.org (Alexey Shvetsov) Date: Wed, 21 Jan 2009 17:55:21 +0300 Subject: [ofa-general] ***SPAM*** OFED-1.4 and Gentoo Message-ID: <200901211755.28192.alexxy@gentoo.org> Hi all! I want to add OFED-1.4 to gentoo main tree but its a little bit annoying since OFED-1.4.tgz contains src rpms with regular tarbolls inside Can OpenFabrics org devs make split OFED-1.4 or next OFED release tarbolls publicaly available? -- Alexey 'Alexxy' Shvetsov -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: From vlad at lists.openfabrics.org Wed Jan 21 07:22:39 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 21 Jan 2009 07:22:39 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090121-0558 daily build status Message-ID: <20090121152239.D0B3AE60F24@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on i686 with linux-2.6.16 Build failed on i686 with linux-2.6.18 Build failed on i686 with linux-2.6.17 From sashak at voltaire.com Wed Jan 21 07:31:59 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 17:31:59 +0200 Subject: [ofa-general] ***SPAM*** OFED-1.4 and Gentoo In-Reply-To: <200901211755.28192.alexxy@gentoo.org> References: <200901211755.28192.alexxy@gentoo.org> Message-ID: <20090121153159.GE3479@sashak.voltaire.com> On 17:55 Wed 21 Jan , Alexey Shvetsov wrote: > > I want to add OFED-1.4 to gentoo main tree but its a little bit annoying since > OFED-1.4.tgz contains src rpms with regular tarbolls inside > Can OpenFabrics org devs make split OFED-1.4 or next OFED release tarbolls > publicaly available? IB management (libibumad, OpenSM, infiniband-diags) and some other tarballs are publicly available at: http://www.openfabrics.org/downloads/ (see also http://www.openfabrics.org/download_linux.htm) Sasha From dorfman.eli at gmail.com Wed Jan 21 07:48:35 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Wed, 21 Jan 2009 17:48:35 +0200 Subject: [ofa-general] [PATCH RFC] opensm: sort port order for routing by switch loads In-Reply-To: <20090120150553.GB28955@sashak.voltaire.com> References: <20090120150553.GB28955@sashak.voltaire.com> Message-ID: <497743D3.70604@gmail.com> Sasha Khapyorsky wrote: > It follows "port order" routing load balancer improvements > (implemented using "--guid_routing_order_file" command line option). > > The idea of this patch is about default behavior and it is to balance > routing paths in such order that most loaded links enter balancer first > - in most cases it should provide a better performance than just > random balancing (as it is done now by default). > > The implementation is simple - endport list for load balancer is reverse > sorted by number of active endport links of leaf switches. > > Signed-off-by: Sasha Khapyorsky > --- > > Comments are appreciated. > > Sasha > > opensm/opensm/osm_ucast_mgr.c | 58 ++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 57 insertions(+), 1 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index 96921a0..58a6714 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -744,6 +744,61 @@ static void clear_prof_ignore_flag(cl_map_item_t * const p_map_item, void *ctx) > } > } > > +static void add_sw_endports_to_order_list(osm_switch_t *sw, osm_ucast_mgr_t *m) > +{ > + osm_port_t *port; > + osm_physp_t *p; > + int i; > + for (i = 1; i < sw->num_ports; i++) { > + p = osm_node_get_physp_ptr(sw->p_node, i); > + if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw) { > + port = osm_get_port_by_guid(m->p_subn, > + p->p_remote_physp->port_guid); > + cl_qlist_insert_tail(&m->port_order_list, > + &port->list_item); > + port->flag = 1; > + } > + } > +} > + > +static int sw_count_endport_links(osm_switch_t * const *s) isn't it better calculate this count only once before sort_ports_by_switch_load() > +{ > + const osm_switch_t *sw = *s; > + int i, n = 0; > + for (i = 1; i < sw->num_ports; i++) { > + osm_physp_t *p = osm_node_get_physp_ptr(sw->p_node, i); > + if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw && > + ib_port_info_get_port_state(&p->port_info) == > + IB_LINK_ACTIVE) > + n++; > + } > + return n; > +} > + > +static int compar_sw_load(const void *s1, const void *s2) > +{ > + return sw_count_endport_links(s2) - sw_count_endport_links(s1); > +} > + > +static void sort_ports_by_switch_load(osm_ucast_mgr_t *m) > +{ > + int i, num = cl_qmap_count(&m->p_subn->sw_guid_tbl); > + osm_switch_t **s = malloc(num * sizeof(*s)); > + if (!s) { > + OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR: " > + "No memory, skip by switch load sorting.\n"); > + return; > + } > + s[0] = (osm_switch_t *)cl_qmap_head(&m->p_subn->sw_guid_tbl); > + for (i = 1; i < num; i++) > + s[i] = (osm_switch_t *)cl_qmap_next(&s[i-1]->map_item); > + > + qsort(s, num, sizeof(*s), compar_sw_load); > + > + for (i = 0; i < num; i++) > + add_sw_endports_to_order_list(s[i], m); > +} > + > static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) > { > cl_qlist_init(&p_mgr->port_order_list); > @@ -758,7 +813,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) > OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : " > "cannot parse guid routing order file \'%s\'\n", > p_mgr->p_subn->opt.guid_routing_order_file); > - } > + } else > + sort_ports_by_switch_load(p_mgr); > > if (p_mgr->p_subn->opt.port_prof_ignore_file) { > cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, From sashak at voltaire.com Wed Jan 21 07:50:07 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 17:50:07 +0200 Subject: [ofa-general] Re: [PATCH 1/3 - no ibcommon] Create a new library libibnetdisc In-Reply-To: <20090109154749.4b19c8bf.weiny2@llnl.gov> References: <20090109154749.4b19c8bf.weiny2@llnl.gov> Message-ID: <20090121154959.GF3479@sashak.voltaire.com> On 15:47 Fri 09 Jan , Ira Weiny wrote: > From 677ca6d7ead4b720b2ba260cb35aca429190b6e8 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Wed, 26 Nov 2008 12:54:47 -0800 > Subject: [PATCH] Create a new library libibnetdisc Some more... [snip...] > diff --git a/infiniband-diags/libibnetdisc/Makefile.am b/infiniband-diags/libibnetdisc/Makefile.am > new file mode 100644 > index 0000000..8e0e16b > --- /dev/null > +++ b/infiniband-diags/libibnetdisc/Makefile.am > @@ -0,0 +1,62 @@ > + > +#SUBDIRS = . > + > +INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband > + > +lib_LTLIBRARIES = libibnetdisc.la > +sbin_PROGRAMS = > + > +if ENABLE_TEST_UTILS > +sbin_PROGRAMS += test/ibnetdisctest \ > + test/iblinkinfotest \ > + test/testleaks > +endif > + > +DBGFLAGS = -g > + > +if HAVE_LD_VERSION_SCRIPT > +libibnetdisc_version_script = -Wl,--version-script=$(srcdir)/src/libibnetdisc.map > +else > +libibnetdisc_version_script = > +endif > + > +libibnetdisc_la_SOURCES = src/ibnetdisc.c src/chassis.c src/chassis.h > +libibnetdisc_la_CFLAGS = -Wall $(DBGFLAGS) > +libibnetdisc_la_LDFLAGS = -version-info $(ibnetdisc_api_version) \ > + -export-dynamic $(libibnetdisc_version_script) \ > + -losmcomp -libmad I don't really see where libosmcomp is used in libibnetdisc, likely this flag (-losmcomp) should go to test tools. > +libibnetdisc_la_DEPENDENCIES = $(srcdir)/src/libibnetdisc.map > + > +libibnetdiscincludedir = $(includedir)/infiniband > + > +test_ibnetdisctest_SOURCES = test/ibnetdisctest.c > +test_ibnetdisctest_CFLAGS = -Wall $(DBGFLAGS) > +test_ibnetdisctest_LDFLAGS = -libnetdisc > + > +test_iblinkinfotest_SOURCES = test/iblinkinfotest.c > +test_iblinkinfotest_CFLAGS = -Wall $(DBGFLAGS) > +test_iblinkinfotest_LDFLAGS = -libnetdisc > + > +test_testleaks_SOURCES = test/testleaks.c > +test_testleaks_CFLAGS = -Wall $(DBGFLAGS) > +test_testleaks_LDFLAGS = -libnetdisc > + > +libibnetdiscinclude_HEADERS = $(srcdir)/include/infiniband/ibnetdisc.h > + > +man_MANS = man/ibnd_debug.3 \ > + man/ibnd_destroy_fabric.3 \ > + man/ibnd_discover_fabric.3 \ > + man/ibnd_find_node_dr.3 \ > + man/ibnd_find_node_guid.3 \ > + man/ibnd_iter_nodes.3 \ > + man/ibnd_iter_nodes_type.3 \ > + man/ibnd_linkspeed_str.3 \ > + man/ibnd_linkstate_str.3 \ > + man/ibnd_linkwidth_str.3 \ > + man/ibnd_node_type_str.3 \ > + man/ibnd_physstate_str.3 \ > + man/ibnd_update_node.3 \ > + man/ibnd_show_progress.3 > + > +EXTRA_DIST = $(srcdir)/src/libibnetdisc.map libibnetdisc.ver autogen.sh > + There is no 'autogen.sh' in libibnetdisc. Sasha From sashak at voltaire.com Wed Jan 21 07:54:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 17:54:24 +0200 Subject: [ofa-general] [PATCH RFC] opensm: sort port order for routing by switch loads In-Reply-To: <497743D3.70604@gmail.com> References: <20090120150553.GB28955@sashak.voltaire.com> <497743D3.70604@gmail.com> Message-ID: <20090121155416.GG3479@sashak.voltaire.com> On 17:48 Wed 21 Jan , Eli Dorfman (Voltaire) wrote: > > > > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > > index 96921a0..58a6714 100644 > > --- a/opensm/opensm/osm_ucast_mgr.c > > +++ b/opensm/opensm/osm_ucast_mgr.c > > @@ -744,6 +744,61 @@ static void clear_prof_ignore_flag(cl_map_item_t * const p_map_item, void *ctx) > > } > > } > > > > +static void add_sw_endports_to_order_list(osm_switch_t *sw, osm_ucast_mgr_t *m) > > +{ > > + osm_port_t *port; > > + osm_physp_t *p; > > + int i; > > + for (i = 1; i < sw->num_ports; i++) { > > + p = osm_node_get_physp_ptr(sw->p_node, i); > > + if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw) { > > + port = osm_get_port_by_guid(m->p_subn, > > + p->p_remote_physp->port_guid); > > + cl_qlist_insert_tail(&m->port_order_list, > > + &port->list_item); > > + port->flag = 1; > > + } > > + } > > +} > > + > > +static int sw_count_endport_links(osm_switch_t * const *s) > > isn't it better calculate this count only once before sort_ports_by_switch_load() And to store somewhere in osm_switch structure? Yes, this can be nice micro optimization. Sasha > > > +{ > > + const osm_switch_t *sw = *s; > > + int i, n = 0; > > + for (i = 1; i < sw->num_ports; i++) { > > + osm_physp_t *p = osm_node_get_physp_ptr(sw->p_node, i); > > + if (p && p->p_remote_physp && !p->p_remote_physp->p_node->sw && > > + ib_port_info_get_port_state(&p->port_info) == > > + IB_LINK_ACTIVE) > > + n++; > > + } > > + return n; > > +} > > + > > +static int compar_sw_load(const void *s1, const void *s2) > > +{ > > + return sw_count_endport_links(s2) - sw_count_endport_links(s1); > > +} > > + > > +static void sort_ports_by_switch_load(osm_ucast_mgr_t *m) > > +{ > > + int i, num = cl_qmap_count(&m->p_subn->sw_guid_tbl); > > + osm_switch_t **s = malloc(num * sizeof(*s)); > > + if (!s) { > > + OSM_LOG(m->p_log, OSM_LOG_ERROR, "ERR: " > > + "No memory, skip by switch load sorting.\n"); > > + return; > > + } > > + s[0] = (osm_switch_t *)cl_qmap_head(&m->p_subn->sw_guid_tbl); > > + for (i = 1; i < num; i++) > > + s[i] = (osm_switch_t *)cl_qmap_next(&s[i-1]->map_item); > > + > > + qsort(s, num, sizeof(*s), compar_sw_load); > > + > > + for (i = 0; i < num; i++) > > + add_sw_endports_to_order_list(s[i], m); > > +} > > + > > static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) > > { > > cl_qlist_init(&p_mgr->port_order_list); > > @@ -758,7 +813,8 @@ static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) > > OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR : " > > "cannot parse guid routing order file \'%s\'\n", > > p_mgr->p_subn->opt.guid_routing_order_file); > > - } > > + } else > > + sort_ports_by_switch_load(p_mgr); > > > > if (p_mgr->p_subn->opt.port_prof_ignore_file) { > > cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, > From sean.hefty at intel.com Wed Jan 21 07:51:29 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 21 Jan 2009 07:51:29 -0800 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <20090120192413.GA2037@sashak.voltaire.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> <20090120190111.GE28955@sashak.voltaire.com> <7989935C4AA44367A115F0776F67CF17@amr.corp.intel.com> <20090120192413.GA2037@sashak.voltaire.com> Message-ID: >> >What about to use c99_to_whatever preprocessor >> >(sed -e 's/\[.*\] *= *//')? >> >> I don't think this helps. Unless the definitions are in the correct >location, > >Sure. And obviously I was about more sofisticated preprocessor ([] >constans should be evaluated and an array sorted properly, etc). I think all of this is overkill. Is it really that difficult to ensure that new fields are placed into the array correctly, especially compared to the cost, say, of ensuring that the field offsets are correct or maintaining an additional pre-processing script? We're reducing the broader maintenance costs overall by having a single code base. Is this is only open issue integrating the ib-mgmt WinOF support changes? - Sean From sashak at voltaire.com Wed Jan 21 08:19:46 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 18:19:46 +0200 Subject: [ofa-general] Re: [PATCH 2/3 - no ibcommon] Convert iblinkinfo.pl to C and use new ibnetdisc library. In-Reply-To: <20090109154754.1d526572.weiny2@llnl.gov> References: <20090109154754.1d526572.weiny2@llnl.gov> Message-ID: <20090121161946.GH3479@sashak.voltaire.com> On 15:47 Fri 09 Jan , Ira Weiny wrote: > From 139d0ad5ecffecd5c325d865e96cdf038e03a3e5 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Mon, 1 Dec 2008 14:55:10 -0800 > Subject: [PATCH] Convert iblinkinfo.pl to C and use new ibnetdisc library. > > > Signed-off-by: weiny2 at llnl.gov > --- > infiniband-diags/Makefile.am | 12 +- > infiniband-diags/scripts/iblinkinfo.pl | 327 ------------------------ > infiniband-diags/src/iblinkinfo.c | 423 ++++++++++++++++++++++++++++++++ > 3 files changed, 432 insertions(+), 330 deletions(-) > delete mode 100755 infiniband-diags/scripts/iblinkinfo.pl > create mode 100644 infiniband-diags/src/iblinkinfo.c > > diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am > index 8e8c3c1..d127a4d 100644 > --- a/infiniband-diags/Makefile.am > +++ b/infiniband-diags/Makefile.am > @@ -1,6 +1,7 @@ > SUBDIRS = libibnetdisc > > -INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband > +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband \ > + -I$(top_builddir)/libibnetdisc/include > > if DEBUG > DBGFLAGS = -ggdb -D_DEBUG_ > @@ -11,7 +12,7 @@ endif > sbin_PROGRAMS = src/ibaddr src/ibnetdiscover src/ibping src/ibportstate \ > src/ibroute src/ibstat src/ibsysstat src/ibtracert \ > src/perfquery src/sminfo src/smpdump src/smpquery \ > - src/saquery src/vendstat > + src/saquery src/vendstat src/iblinkinfo.pl Is it a good idea to make binary file with .pl extension? Probably you like to preserve backward compatibility, but in this case I think it would be better to have real tiny perl script which executes iblinkinfo program. > > if ENABLE_TEST_UTILS > sbin_PROGRAMS += src/ibsendtrap src/mcm_rereg_test > @@ -28,7 +29,7 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \ > scripts/dump_lfts.sh scripts/dump_mfts.sh \ > scripts/set_nodedesc.sh \ > scripts/ibqueryerrors.pl scripts/ibswportwatch.pl \ > - scripts/iblinkinfo.pl scripts/ibprintswitch.pl \ > + scripts/ibprintswitch.pl \ > scripts/ibprintca.pl scripts/ibprintrt.pl \ > scripts/ibfindnodesusing.pl scripts/ibidsverify.pl \ > scripts/check_lft_balance.pl > @@ -40,6 +41,11 @@ src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common > src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS) > src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) > > +src_iblinkinfo_pl_SOURCES = src/iblinkinfo.c > +src_iblinkinfo_pl_CFLAGS = -Wall $(DBGFLAGS) > +src_iblinkinfo_pl_LDFLAGS = -Wl,--rpath -Wl,$(libdir) \ > + -L$(srcdir)/libibnetdisc -libnetdisc > + > src_ibping_SOURCES = src/ibping.c src/ibdiag_common.c > src_ibping_CFLAGS = -Wall $(DBGFLAGS) > [snip..] > diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c > new file mode 100644 > index 0000000..c8f224e > --- /dev/null > +++ b/infiniband-diags/src/iblinkinfo.c > @@ -0,0 +1,423 @@ > +/* > + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#define _GNU_SOURCE > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > + > +char *argv0 = "iblinkinfotest"; > +static FILE *f; > + > +static char *node_name_map_file = NULL; > +static nn_map_t *node_name_map = NULL; > + > +static int timeout_ms = 500; > + > +static int down_links_only = 0; > +static int line_mode = 0; > +static int add_sw_settings = 0; > +static int print_port_guids = 0; > +static int old_output = 0; > + > +static unsigned int > +get_max(unsigned int num) > +{ > + unsigned int v = num; // 32-bit word to find the log base 2 of > + unsigned r = 0; // r will be lg(v) > + > + while (v >>= 1) // unroll for more speed... > + { > + r++; > + } > + > + return (1 << r); > +} > + > +static char * > +get_linkspeed_str(int link_speed) > +{ > + return (ibnd_linkspeed_str(link_speed, old_output)); > +} > + > +void > +get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t *port) > +{ > + int max_speed = 0; > + > + int max_width = get_max(port->info.link_width_supported > + & port->remoteport->info.link_width_supported); > + if ((max_width & port->info.link_width_active) == 0) { > + // we are not at the max supported width > + // print what we could be at. > + snprintf(width_msg, msg_size, "Could be %s", > + ibnd_linkwidth_str(max_width)); > + } > + > + max_speed = get_max(port->info.link_speed_supported > + & port->remoteport->info.link_speed_supported); > + if ((max_speed & port->info.link_speed_active) == 0) { > + // we are not at the max supported speed > + // print what we could be at. > + snprintf(speed_msg, msg_size, "Could be %s", > + get_linkspeed_str(max_speed)); > + } > +} > + > +void > +print_port(ibnd_node_t *node, ibnd_port_t *port) > +{ > + static char remote_guid_str[256]; > + static char remote_str[256]; > + static char link_str[256]; > + static char width_msg[256]; > + static char speed_msg[256]; > + static char ext_port_str[256]; > + static char loc_sma_lid[16]; > + > + if (!port) > + return; > + > + remote_guid_str[0] = '\0'; > + remote_str[0] = '\0'; > + link_str[0] = '\0'; > + width_msg[0] = '\0'; > + speed_msg[0] = '\0'; > + > + snprintf(loc_sma_lid, 16, "%d", node->smalid); > + if (port->remoteport) { > + static char remote_name_buf[256]; Should it be static? > + strncpy(remote_name_buf, port->remoteport->node->nodedesc, 256); > + > + if (port->remoteport->ext_portnum) > + snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum); > + else > + ext_port_str[0] = '\0'; > + > + get_msg(width_msg, speed_msg, 256, port); > + if (line_mode) { > + if (print_port_guids) { > + snprintf(remote_guid_str, 256, > + "0x%016"PRIx64" ", > + port->remoteport->guid); > + } else { > + snprintf(remote_guid_str, 256, > + "0x%016"PRIx64" ", > + port->remoteport->node->info.nodeguid); > + } Really nit (pretty optional - just saves few lines)... snprintf(remote_guid_str, sizeof(remote_guid_str), "0x%016"PRIx64" ", print_port_guids ? port->remoteport->guid : port->remoteport->node->info.nodeguid); > + } > + > + snprintf(remote_str, 256, > + "%s%6d %4d[%2s] \"%s\" ( %s %s)\n", > + remote_guid_str, > + port->remoteport->info.base_lid ? > + port->remoteport->info.base_lid : > + port->remoteport->node->smalid, > + port->remoteport->portnum, > + ext_port_str, > + remap_node_name(node_name_map, > + port->remoteport->node->info.nodeguid, > + remote_name_buf), > + width_msg, > + speed_msg > + ); > + } else { > + snprintf(remote_str, 256, > + "%19s%6s %4s[%2s] \"\" ( )\n", "", "", "", ""); > + if (old_output) { > + loc_sma_lid[0] = '\0'; > + } > + } > + > + > + if (add_sw_settings) { > + snprintf(link_str, 256, > + "(%3s %s %6s / %8s) (HOQ:%d VL_Stall:%d)", > + ibnd_linkwidth_str(port->info.link_width_active), > + get_linkspeed_str(port->info.link_speed_active), > + ibnd_linkstate_str(port->info.port_state), > + ibnd_physstate_str(port->info.phys_state), > + port->info.hoq_lifetime, > + port->info.vl_stall_count > + ); > + } else { > + snprintf(link_str, 256, > + "(%3s %s %6s / %8s)", > + ibnd_linkwidth_str(port->info.link_width_active), > + get_linkspeed_str(port->info.link_speed_active), > + ibnd_linkstate_str(port->info.port_state), > + ibnd_physstate_str(port->info.phys_state) > + ); > + } similar... n = snprintf(link_str, sizeof(link_str), "(%3s %s %6s / %8s)", ibnd_linkwidth_str(port->info.link_width_active), get_linkspeed_str(port->info.link_speed_active), ibnd_linkstate_str(port->info.port_state), ibnd_physstate_str(port->info.phys_state)); if (add_sw_settings && n < sizeof(link_str)) snprintf(link_str + n, sizeof(link_str) - n, " (HOQ:%d VL_Stall:%d)", port->info.hoq_lifetime, port->info.vl_stall_count); > + > + if (port->ext_portnum) > + snprintf(ext_port_str, 256, "%d", port->ext_portnum); > + else > + ext_port_str[0] = '\0'; > + > + if (line_mode) { > + static char name_buf[256]; > + char *node_name = ""; > + > + if (old_output && (!port->remoteport)) { > + node_name = ""; > + } else { > + strncpy(name_buf, node->nodedesc, 256); > + node_name = remap_node_name(node_name_map, > + node->info.nodeguid, > + name_buf); > + } > + > + printf("0x%016"PRIx64" \"%30s\" %6s %4d[%2s] ==%s==> %s", > + node->info.nodeguid, > + node_name, > + loc_sma_lid, port->portnum, > + ext_port_str, > + link_str, > + remote_str > + ); > + } else { > + printf(" %6s %4d[%2s] ==%s==> %s", > + loc_sma_lid, port->portnum, > + ext_port_str, > + link_str, > + remote_str > + ); > + } > +} > + > +void > +print_switch(ibnd_node_t *node, void *user_data) > +{ > + int i = 0; > + > + if (!line_mode) { > + char name_buf[256]; > + strncpy(name_buf, node->nodedesc, 256); > + printf("Switch 0x%016"PRIx64" %s:\n", > + node->info.nodeguid, > + remap_node_name(node_name_map, > + node->info.nodeguid, > + name_buf)); > + } > + > + for (i = 1; i <= node->info.numports; i++) { > + ibnd_port_t *port = node->ports[i]; > + if (!port) > + continue; > + if (!down_links_only || port->info.port_state == IBND_LINK_DOWN) { > + print_port(node, port); > + } > + } > +} > + > +void > +usage(void) > +{ > + fprintf(stderr, > + "Usage: %s [-hclp -S -D -C -P ]\n" > + " Report link speed and connection for each port of each switch which is active\n" > + " -h This help message\n" > + " -S output only the node specified by guid\n" > + " -D print only node specified by \n" > + " -f specify node to start \"from\"\n" > + " -n Number of hops to include away from specified node\n" > + " -d print only down links\n" > + " -l (line mode) print all information for each link on each line\n" > + " -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n" > + > + > + " -t timeout for any single fabric query\n" > + " -s show progress during scan\n" > + " --node-name-map use specified node name map\n" > + > + " -C use selected Channel Adaptor name for queries\n" > + " -P use selected channel adaptor port for queries\n" > + " -g print port guids instead of node guids\n" > + " --debug print debug messages\n" > + " -R (this option is obsolete and does nothing)\n" > + , > + argv0); > + exit(-1); > +} > + > +int > +main(int argc, char **argv) > +{ > + char *ca = 0; > + int ca_port = 0; > + ibnd_fabric_t *fabric = NULL; > + uint64_t guid = 0; > + char *dr_path = NULL; > + char *from = NULL; > + int hops = 0; > + ib_portid_t port_id; > + > + static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:R"; > + static const struct option long_opts[] = { > + { "S", 1, 0, 'S'}, > + { "D", 1, 0, 'D'}, > + { "num-hops", 1, 0, 'n'}, > + { "down-links-only", 0, 0, 'd'}, > + { "line-mode", 0, 0, 'l'}, > + { "ca-name", 1, 0, 'C'}, > + { "ca-port", 1, 0, 'P'}, > + { "timeout", 1, 0, 't'}, > + { "show", 0, 0, 's'}, > + { "print-port-guids", 0, 0, 'g'}, > + { "print-additional", 0, 0, 'p'}, > + { "help", 0, 0, 'h'}, > + { "usage", 0, 0, 'u'}, > + { "node-name-map", 1, 0, 1}, > + { "debug", 0, 0, 2}, > + { "compat", 0, 0, 3}, > + { "from", 1, 0, 'f'}, > + { "R", 0, 0, 'R'}, > + { } > + }; > + > + f = stdout; Where is 'f' used? > + > + argv0 = argv[0]; > + > + while (1) { > + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); > + if ( ch == -1 ) > + break; > + switch(ch) { > + case 1: > + node_name_map_file = strdup(optarg); Should we use strdup() here and below? If yes, I don't see any free(). Sasha > + break; > + case 2: > + ibnd_debug(1); > + break; > + case 3: > + old_output = 1; > + break; > + case 'f': > + from = strdup(optarg); > + break; > + case 'C': > + ca = strdup(optarg); > + break; > + case 'P': > + ca_port = strtoul(optarg, 0, 0); > + break; > + case 'D': > + dr_path = strdup(optarg); > + break; > + case 'n': > + hops = (int)strtol(optarg, NULL, 0); > + break; > + case 'd': > + down_links_only = 1; > + break; > + case 'l': > + line_mode = 1; > + break; > + case 't': > + timeout_ms = strtoul(optarg, 0, 0); > + break; > + case 's': > + ibnd_show_progress(1); > + break; > + case 'g': > + print_port_guids = 1; > + break; > + case 'S': > + guid = (uint64_t)strtoull(optarg, 0, 0); > + break; > + case 'p': > + add_sw_settings = 1; > + break; > + case 'R': > + /* GNDN */ > + break; > + default: > + usage(); > + break; > + } > + } > + argc -= optind; > + argv += optind; > + > + if (argc && !(f = fopen(argv[0], "w"))) > + fprintf(stderr, "can't open file %s for writing", argv[0]); > + > + node_name_map = open_node_name_map(node_name_map_file); > + > + if (from) { > + /* only scan part of the fabric */ > + str2drpath(&(port_id.drpath), from, 0, 0); > + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) { > + fprintf(stderr, "discover failed\n"); > + exit(1); > + } > + guid = 0; > + } else { > + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { > + fprintf(stderr, "discover failed\n"); > + exit(1); > + } > + } > + > + if (guid) { > + ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid); > + print_switch(sw, NULL); > + } else if (dr_path) { > + ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path); > + print_switch(sw, NULL); > + } else { > + ibnd_iter_nodes_type(fabric, print_switch, IBND_SWITCH_NODE, NULL); > + } > + > + ibnd_destroy_fabric(fabric); > + > + close_node_name_map(node_name_map); > + exit(0); > +} > -- > 1.5.4.5 > From chien.tin.tung at intel.com Wed Jan 21 09:05:54 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Wed, 21 Jan 2009 11:05:54 -0600 Subject: [ofa-general] (no subject) Message-ID: <20090121170554.GA3672@ctung-MOBL> From: Don Wood Two level 256 byte pbls was not implemented so the driver could report out of memory when in fact there were pbls still available. The solution prefers to use 4KB pbls over two level 256B pbls until the number of 4KB pbls falls below a threshold. At this point the 4KB pbl structure is converted to use 256B pbls which prevents the driver from running out of 4KB pbls too quickly. Also, fixed two places where the software pbl counts were changed before the hardware was updated. This bug allowed another thread to overallocate the hardware resources. Signed-off-by: Don Wood --- drivers/infiniband/hw/nes/nes_verbs.c | 247 ++++++++++++++++++++++---------- 1 files changed, 170 insertions(+), 77 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 4cfb4d9..488e981 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -551,6 +551,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) struct nes_device *nesdev = nesvnic->nesdev; struct nes_adapter *nesadapter = nesdev->nesadapter; int i = 0; + int rc; /* free the resources */ if (nesfmr->leaf_pbl_cnt == 0) { @@ -572,6 +573,8 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) nesmr->ibmw.rkey = ibfmr->rkey; nesmr->ibmw.uobject = NULL; + rc = nes_dealloc_mw(&nesmr->ibmw); + if (nesfmr->nesmr.pbls_used != 0) { spin_lock_irqsave(&nesadapter->pbl_lock, flags); if (nesfmr->nesmr.pbl_4k) { @@ -584,7 +587,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); } - return nes_dealloc_mw(&nesmr->ibmw); + return rc; } @@ -1884,21 +1887,75 @@ static int nes_destroy_cq(struct ib_cq *ib_cq) return ret; } +/** + * root_256 + */ +static u32 root_256(struct nes_device *nesdev, + struct nes_root_vpbl *root_vpbl, + struct nes_root_vpbl *new_root, + u16 pbl_count_4k, + u16 pbl_count_256) +{ + u64 leaf_pbl; + int i, j, k; + + if (pbl_count_4k == 1) { + new_root->pbl_vbase = pci_alloc_consistent(nesdev->pcidev, + 512, &new_root->pbl_pbase); + + if (new_root->pbl_vbase == NULL) + return 0; + + leaf_pbl = (u64)root_vpbl->pbl_pbase; + for (i = 0; i < 16; i++) { + new_root->pbl_vbase[i].pa_low = + cpu_to_le32((u32)leaf_pbl); + new_root->pbl_vbase[i].pa_high = + cpu_to_le32((u32)((((u64)leaf_pbl) >> 32))); + leaf_pbl += 256; + } + } else { + for (i = 3; i >= 0; i--) { + j = i * 16; + root_vpbl->pbl_vbase[j] = root_vpbl->pbl_vbase[i]; + leaf_pbl = le32_to_cpu(root_vpbl->pbl_vbase[j].pa_low) + + (((u64)le32_to_cpu(root_vpbl->pbl_vbase[j].pa_high)) + << 32); + for (k = 1; k < 16; k++) { + leaf_pbl += 256; + root_vpbl->pbl_vbase[j + k].pa_low = + cpu_to_le32((u32)leaf_pbl); + root_vpbl->pbl_vbase[j + k].pa_high = + cpu_to_le32((u32)((((u64)leaf_pbl) >> 32))); + } + } + } + + return 1; +} + /** * nes_reg_mr */ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, u32 stag, u64 region_length, struct nes_root_vpbl *root_vpbl, - dma_addr_t single_buffer, u16 pbl_count, u16 residual_page_count, - int acc, u64 *iova_start) + dma_addr_t single_buffer, u16 pbl_count_4k, + u16 residual_page_count_4k, int acc, u64 *iova_start, + u16 *actual_pbl_cnt, u8 *used_4k_pbls) { struct nes_hw_cqp_wqe *cqp_wqe; struct nes_cqp_request *cqp_request; unsigned long flags; int ret; struct nes_adapter *nesadapter = nesdev->nesadapter; - /* int count; */ + uint pg_cnt = 0; + u16 pbl_count_256; + u16 pbl_count = 0; + u8 use_256_pbls = 0; + u8 use_4k_pbls = 0; + u16 use_two_level = (pbl_count_4k > 1) ? 1 : 0; + struct nes_root_vpbl new_root = {0, 0, 0}; u32 opcode = 0; u16 major_code; @@ -1911,41 +1968,70 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, cqp_request->waiting = 1; cqp_wqe = &cqp_request->cqp_wqe; - spin_lock_irqsave(&nesadapter->pbl_lock, flags); - /* track PBL resources */ - if (pbl_count != 0) { - if (pbl_count > 1) { - /* Two level PBL */ - if ((pbl_count+1) > nesadapter->free_4kpbl) { - nes_debug(NES_DBG_MR, "Out of 4KB Pbls for two level request.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_4kpbl -= pbl_count+1; - } - } else if (residual_page_count > 32) { - if (pbl_count > nesadapter->free_4kpbl) { - nes_debug(NES_DBG_MR, "Out of 4KB Pbls.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_4kpbl -= pbl_count; + if (pbl_count_4k) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + + pg_cnt = ((pbl_count_4k - 1) * 512) + residual_page_count_4k; + pbl_count_256 = (pg_cnt + 31) / 32; + if (pg_cnt <= 32) { + if (pbl_count_256 <= nesadapter->free_256pbl) + use_256_pbls = 1; + else if (pbl_count_4k <= nesadapter->free_4kpbl) + use_4k_pbls = 1; + } else if (pg_cnt <= 2048) { + if (((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) && + (nesadapter->free_4kpbl > (nesadapter->max_4kpbl >> 1))) { + use_4k_pbls = 1; + } else if ((pbl_count_256 + 1) <= nesadapter->free_256pbl) { + use_256_pbls = 1; + use_two_level = 1; + } else if ((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) { + use_4k_pbls = 1; } } else { - if (pbl_count > nesadapter->free_256pbl) { - nes_debug(NES_DBG_MR, "Out of 256B Pbls.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_256pbl -= pbl_count; - } + if ((pbl_count_4k + 1) <= nesadapter->free_4kpbl) + use_4k_pbls = 1; + } + + if (use_256_pbls) { + pbl_count = pbl_count_256; + nesadapter->free_256pbl -= pbl_count + use_two_level; + } else if (use_4k_pbls) { + pbl_count = pbl_count_4k; + nesadapter->free_4kpbl -= pbl_count + use_two_level; + } else { + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + nes_debug(NES_DBG_MR, "Out of Pbls\n"); + nes_free_cqp_request(nesdev, cqp_request); + return -ENOMEM; } + + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); } - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + if (use_256_pbls && use_two_level) { + if (root_256(nesdev, root_vpbl, &new_root, pbl_count_4k, pbl_count_256) == 1) { + if (new_root.pbl_pbase != 0) + root_vpbl = &new_root; + } else { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + nesadapter->free_256pbl += pbl_count_256 + use_two_level; + use_256_pbls = 0; + + if (pbl_count_4k == 1) + use_two_level = 0; + pbl_count = pbl_count_4k; + + if ((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) { + nesadapter->free_4kpbl -= pbl_count + use_two_level; + use_4k_pbls = 1; + } + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + + if (use_4k_pbls == 0) + return -ENOMEM; + } + } opcode = NES_CQP_REGISTER_STAG | NES_CQP_STAG_RIGHTS_LOCAL_READ | NES_CQP_STAG_VA_TO | NES_CQP_STAG_MR; @@ -1974,10 +2060,9 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, } else { set_wqe_64bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PA_LOW_IDX, root_vpbl->pbl_pbase); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX, pbl_count); - set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_LEN_IDX, - (((pbl_count - 1) * 4096) + (residual_page_count*8))); + set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_LEN_IDX, (pg_cnt * 8)); - if ((pbl_count > 1) || (residual_page_count > 32)) + if (use_4k_pbls) cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_STAG_PBL_BLK_SIZE); } barrier(); @@ -1994,13 +2079,25 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, major_code = cqp_request->major_code; nes_put_cqp_request(nesdev, cqp_request); + if ((!ret || major_code) && pbl_count != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (use_256_pbls) + nesadapter->free_256pbl += pbl_count + use_two_level; + else if (use_4k_pbls) + nesadapter->free_4kpbl += pbl_count + use_two_level; + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } + if (new_root.pbl_pbase) + pci_free_consistent(nesdev->pcidev, 512, new_root.pbl_vbase, + new_root.pbl_pbase); + if (!ret) return -ETIME; else if (major_code) return -EIO; - else - return 0; + *actual_pbl_cnt = pbl_count + use_two_level; + *used_4k_pbls = use_4k_pbls; return 0; } @@ -2165,18 +2262,14 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, pbl_count = root_pbl_index; } ret = nes_reg_mr(nesdev, nespd, stag, region_length, &root_vpbl, - buffer_list[0].addr, pbl_count, (u16)cur_pbl_index, acc, iova_start); + buffer_list[0].addr, pbl_count, (u16)cur_pbl_index, acc, iova_start, + &nesmr->pbls_used, &nesmr->pbl_4k); if (ret == 0) { nesmr->ibmr.rkey = stag; nesmr->ibmr.lkey = stag; nesmr->mode = IWNES_MEMREG_TYPE_MEM; ibmr = &nesmr->ibmr; - nesmr->pbl_4k = ((pbl_count > 1) || (cur_pbl_index > 32)) ? 1 : 0; - nesmr->pbls_used = pbl_count; - if (pbl_count > 1) { - nesmr->pbls_used++; - } } else { kfree(nesmr); ibmr = ERR_PTR(-ENOMEM); @@ -2454,8 +2547,9 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, stag, (unsigned int)iova_start, (unsigned int)region_length, stag_index, (unsigned long long)region->length, pbl_count); - ret = nes_reg_mr( nesdev, nespd, stag, region->length, &root_vpbl, - first_dma_addr, pbl_count, (u16)cur_pbl_index, acc, &iova_start); + ret = nes_reg_mr(nesdev, nespd, stag, region->length, &root_vpbl, + first_dma_addr, pbl_count, (u16)cur_pbl_index, acc, + &iova_start, &nesmr->pbls_used, &nesmr->pbl_4k); nes_debug(NES_DBG_MR, "ret=%d\n", ret); @@ -2464,11 +2558,6 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, nesmr->ibmr.lkey = stag; nesmr->mode = IWNES_MEMREG_TYPE_MEM; ibmr = &nesmr->ibmr; - nesmr->pbl_4k = ((pbl_count > 1) || (cur_pbl_index > 32)) ? 1 : 0; - nesmr->pbls_used = pbl_count; - if (pbl_count > 1) { - nesmr->pbls_used++; - } } else { ib_umem_release(region); kfree(nesmr); @@ -2607,24 +2696,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) cqp_request->waiting = 1; cqp_wqe = &cqp_request->cqp_wqe; - spin_lock_irqsave(&nesadapter->pbl_lock, flags); - if (nesmr->pbls_used != 0) { - if (nesmr->pbl_4k) { - nesadapter->free_4kpbl += nesmr->pbls_used; - if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) { - printk(KERN_ERR PFX "free 4KB PBLs(%u) has exceeded the max(%u)\n", - nesadapter->free_4kpbl, nesadapter->max_4kpbl); - } - } else { - nesadapter->free_256pbl += nesmr->pbls_used; - if (nesadapter->free_256pbl > nesadapter->max_256pbl) { - printk(KERN_ERR PFX "free 256B PBLs(%u) has exceeded the max(%u)\n", - nesadapter->free_256pbl, nesadapter->max_256pbl); - } - } - } - - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); nes_fill_init_cqp_wqe(cqp_wqe, nesdev); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_OPCODE_IDX, NES_CQP_DEALLOCATE_STAG | NES_CQP_STAG_VA_TO | @@ -2642,11 +2713,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) " CQP Major:Minor codes = 0x%04X:0x%04X\n", ib_mr->rkey, ret, cqp_request->major_code, cqp_request->minor_code); - nes_free_resource(nesadapter, nesadapter->allocated_mrs, - (ib_mr->rkey & 0x0fffff00) >> 8); - - kfree(nesmr); - major_code = cqp_request->major_code; minor_code = cqp_request->minor_code; @@ -2662,8 +2728,35 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) " to destroy STag, ib_mr=%p, rkey = 0x%08X\n", major_code, minor_code, ib_mr, ib_mr->rkey); return -EIO; - } else - return 0; + } + + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (nesmr->pbls_used != 0) { + if (nesmr->pbl_4k) { + nesadapter->free_4kpbl += nesmr->pbls_used; + if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) + printk(KERN_ERR PFX "free 4KB PBLs(%u) has " + "exceeded the max(%u)\n", + nesadapter->free_4kpbl, + nesadapter->max_4kpbl); + } else { + nesadapter->free_256pbl += nesmr->pbls_used; + if (nesadapter->free_256pbl > nesadapter->max_256pbl) + printk(KERN_ERR PFX "free 256B PBLs(%u) has " + "exceeded the max(%u)\n", + nesadapter->free_256pbl, + nesadapter->max_256pbl); + } + } + + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + (ib_mr->rkey & 0x0fffff00) >> 8); + + kfree(nesmr); + + return 0; } -- 1.5.3.3 From chien.tin.tung at intel.com Wed Jan 21 09:11:08 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Wed, 21 Jan 2009 11:11:08 -0600 Subject: [ofa-general] [PATCH] RDMA/nes: Improved use of pbls Message-ID: <20090121171108.GA672@ctung-MOBL> From: Don Wood Two level 256 byte pbls was not implemented so the driver could report out of memory when in fact there were pbls still available. The solution prefers to use 4KB pbls over two level 256B pbls until the number of 4KB pbls falls below a threshold. At this point the 4KB pbl structure is converted to use 256B pbls which prevents the driver from running out of 4KB pbls too quickly. Also, fixed two places where the software pbl counts were changed before the hardware was updated. This bug allowed another thread to overallocate the hardware resources. Signed-off-by: Don Wood --- Sorry, the subject line in the last email was garbled. Chien drivers/infiniband/hw/nes/nes_verbs.c | 247 ++++++++++++++++++++++---------- 1 files changed, 170 insertions(+), 77 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 4cfb4d9..488e981 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -551,6 +551,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) struct nes_device *nesdev = nesvnic->nesdev; struct nes_adapter *nesadapter = nesdev->nesadapter; int i = 0; + int rc; /* free the resources */ if (nesfmr->leaf_pbl_cnt == 0) { @@ -572,6 +573,8 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) nesmr->ibmw.rkey = ibfmr->rkey; nesmr->ibmw.uobject = NULL; + rc = nes_dealloc_mw(&nesmr->ibmw); + if (nesfmr->nesmr.pbls_used != 0) { spin_lock_irqsave(&nesadapter->pbl_lock, flags); if (nesfmr->nesmr.pbl_4k) { @@ -584,7 +587,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); } - return nes_dealloc_mw(&nesmr->ibmw); + return rc; } @@ -1884,21 +1887,75 @@ static int nes_destroy_cq(struct ib_cq *ib_cq) return ret; } +/** + * root_256 + */ +static u32 root_256(struct nes_device *nesdev, + struct nes_root_vpbl *root_vpbl, + struct nes_root_vpbl *new_root, + u16 pbl_count_4k, + u16 pbl_count_256) +{ + u64 leaf_pbl; + int i, j, k; + + if (pbl_count_4k == 1) { + new_root->pbl_vbase = pci_alloc_consistent(nesdev->pcidev, + 512, &new_root->pbl_pbase); + + if (new_root->pbl_vbase == NULL) + return 0; + + leaf_pbl = (u64)root_vpbl->pbl_pbase; + for (i = 0; i < 16; i++) { + new_root->pbl_vbase[i].pa_low = + cpu_to_le32((u32)leaf_pbl); + new_root->pbl_vbase[i].pa_high = + cpu_to_le32((u32)((((u64)leaf_pbl) >> 32))); + leaf_pbl += 256; + } + } else { + for (i = 3; i >= 0; i--) { + j = i * 16; + root_vpbl->pbl_vbase[j] = root_vpbl->pbl_vbase[i]; + leaf_pbl = le32_to_cpu(root_vpbl->pbl_vbase[j].pa_low) + + (((u64)le32_to_cpu(root_vpbl->pbl_vbase[j].pa_high)) + << 32); + for (k = 1; k < 16; k++) { + leaf_pbl += 256; + root_vpbl->pbl_vbase[j + k].pa_low = + cpu_to_le32((u32)leaf_pbl); + root_vpbl->pbl_vbase[j + k].pa_high = + cpu_to_le32((u32)((((u64)leaf_pbl) >> 32))); + } + } + } + + return 1; +} + /** * nes_reg_mr */ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, u32 stag, u64 region_length, struct nes_root_vpbl *root_vpbl, - dma_addr_t single_buffer, u16 pbl_count, u16 residual_page_count, - int acc, u64 *iova_start) + dma_addr_t single_buffer, u16 pbl_count_4k, + u16 residual_page_count_4k, int acc, u64 *iova_start, + u16 *actual_pbl_cnt, u8 *used_4k_pbls) { struct nes_hw_cqp_wqe *cqp_wqe; struct nes_cqp_request *cqp_request; unsigned long flags; int ret; struct nes_adapter *nesadapter = nesdev->nesadapter; - /* int count; */ + uint pg_cnt = 0; + u16 pbl_count_256; + u16 pbl_count = 0; + u8 use_256_pbls = 0; + u8 use_4k_pbls = 0; + u16 use_two_level = (pbl_count_4k > 1) ? 1 : 0; + struct nes_root_vpbl new_root = {0, 0, 0}; u32 opcode = 0; u16 major_code; @@ -1911,41 +1968,70 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, cqp_request->waiting = 1; cqp_wqe = &cqp_request->cqp_wqe; - spin_lock_irqsave(&nesadapter->pbl_lock, flags); - /* track PBL resources */ - if (pbl_count != 0) { - if (pbl_count > 1) { - /* Two level PBL */ - if ((pbl_count+1) > nesadapter->free_4kpbl) { - nes_debug(NES_DBG_MR, "Out of 4KB Pbls for two level request.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_4kpbl -= pbl_count+1; - } - } else if (residual_page_count > 32) { - if (pbl_count > nesadapter->free_4kpbl) { - nes_debug(NES_DBG_MR, "Out of 4KB Pbls.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_4kpbl -= pbl_count; + if (pbl_count_4k) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + + pg_cnt = ((pbl_count_4k - 1) * 512) + residual_page_count_4k; + pbl_count_256 = (pg_cnt + 31) / 32; + if (pg_cnt <= 32) { + if (pbl_count_256 <= nesadapter->free_256pbl) + use_256_pbls = 1; + else if (pbl_count_4k <= nesadapter->free_4kpbl) + use_4k_pbls = 1; + } else if (pg_cnt <= 2048) { + if (((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) && + (nesadapter->free_4kpbl > (nesadapter->max_4kpbl >> 1))) { + use_4k_pbls = 1; + } else if ((pbl_count_256 + 1) <= nesadapter->free_256pbl) { + use_256_pbls = 1; + use_two_level = 1; + } else if ((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) { + use_4k_pbls = 1; } } else { - if (pbl_count > nesadapter->free_256pbl) { - nes_debug(NES_DBG_MR, "Out of 256B Pbls.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_256pbl -= pbl_count; - } + if ((pbl_count_4k + 1) <= nesadapter->free_4kpbl) + use_4k_pbls = 1; + } + + if (use_256_pbls) { + pbl_count = pbl_count_256; + nesadapter->free_256pbl -= pbl_count + use_two_level; + } else if (use_4k_pbls) { + pbl_count = pbl_count_4k; + nesadapter->free_4kpbl -= pbl_count + use_two_level; + } else { + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + nes_debug(NES_DBG_MR, "Out of Pbls\n"); + nes_free_cqp_request(nesdev, cqp_request); + return -ENOMEM; } + + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); } - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + if (use_256_pbls && use_two_level) { + if (root_256(nesdev, root_vpbl, &new_root, pbl_count_4k, pbl_count_256) == 1) { + if (new_root.pbl_pbase != 0) + root_vpbl = &new_root; + } else { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + nesadapter->free_256pbl += pbl_count_256 + use_two_level; + use_256_pbls = 0; + + if (pbl_count_4k == 1) + use_two_level = 0; + pbl_count = pbl_count_4k; + + if ((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) { + nesadapter->free_4kpbl -= pbl_count + use_two_level; + use_4k_pbls = 1; + } + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + + if (use_4k_pbls == 0) + return -ENOMEM; + } + } opcode = NES_CQP_REGISTER_STAG | NES_CQP_STAG_RIGHTS_LOCAL_READ | NES_CQP_STAG_VA_TO | NES_CQP_STAG_MR; @@ -1974,10 +2060,9 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, } else { set_wqe_64bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PA_LOW_IDX, root_vpbl->pbl_pbase); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX, pbl_count); - set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_LEN_IDX, - (((pbl_count - 1) * 4096) + (residual_page_count*8))); + set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_LEN_IDX, (pg_cnt * 8)); - if ((pbl_count > 1) || (residual_page_count > 32)) + if (use_4k_pbls) cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_STAG_PBL_BLK_SIZE); } barrier(); @@ -1994,13 +2079,25 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, major_code = cqp_request->major_code; nes_put_cqp_request(nesdev, cqp_request); + if ((!ret || major_code) && pbl_count != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (use_256_pbls) + nesadapter->free_256pbl += pbl_count + use_two_level; + else if (use_4k_pbls) + nesadapter->free_4kpbl += pbl_count + use_two_level; + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } + if (new_root.pbl_pbase) + pci_free_consistent(nesdev->pcidev, 512, new_root.pbl_vbase, + new_root.pbl_pbase); + if (!ret) return -ETIME; else if (major_code) return -EIO; - else - return 0; + *actual_pbl_cnt = pbl_count + use_two_level; + *used_4k_pbls = use_4k_pbls; return 0; } @@ -2165,18 +2262,14 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, pbl_count = root_pbl_index; } ret = nes_reg_mr(nesdev, nespd, stag, region_length, &root_vpbl, - buffer_list[0].addr, pbl_count, (u16)cur_pbl_index, acc, iova_start); + buffer_list[0].addr, pbl_count, (u16)cur_pbl_index, acc, iova_start, + &nesmr->pbls_used, &nesmr->pbl_4k); if (ret == 0) { nesmr->ibmr.rkey = stag; nesmr->ibmr.lkey = stag; nesmr->mode = IWNES_MEMREG_TYPE_MEM; ibmr = &nesmr->ibmr; - nesmr->pbl_4k = ((pbl_count > 1) || (cur_pbl_index > 32)) ? 1 : 0; - nesmr->pbls_used = pbl_count; - if (pbl_count > 1) { - nesmr->pbls_used++; - } } else { kfree(nesmr); ibmr = ERR_PTR(-ENOMEM); @@ -2454,8 +2547,9 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, stag, (unsigned int)iova_start, (unsigned int)region_length, stag_index, (unsigned long long)region->length, pbl_count); - ret = nes_reg_mr( nesdev, nespd, stag, region->length, &root_vpbl, - first_dma_addr, pbl_count, (u16)cur_pbl_index, acc, &iova_start); + ret = nes_reg_mr(nesdev, nespd, stag, region->length, &root_vpbl, + first_dma_addr, pbl_count, (u16)cur_pbl_index, acc, + &iova_start, &nesmr->pbls_used, &nesmr->pbl_4k); nes_debug(NES_DBG_MR, "ret=%d\n", ret); @@ -2464,11 +2558,6 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, nesmr->ibmr.lkey = stag; nesmr->mode = IWNES_MEMREG_TYPE_MEM; ibmr = &nesmr->ibmr; - nesmr->pbl_4k = ((pbl_count > 1) || (cur_pbl_index > 32)) ? 1 : 0; - nesmr->pbls_used = pbl_count; - if (pbl_count > 1) { - nesmr->pbls_used++; - } } else { ib_umem_release(region); kfree(nesmr); @@ -2607,24 +2696,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) cqp_request->waiting = 1; cqp_wqe = &cqp_request->cqp_wqe; - spin_lock_irqsave(&nesadapter->pbl_lock, flags); - if (nesmr->pbls_used != 0) { - if (nesmr->pbl_4k) { - nesadapter->free_4kpbl += nesmr->pbls_used; - if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) { - printk(KERN_ERR PFX "free 4KB PBLs(%u) has exceeded the max(%u)\n", - nesadapter->free_4kpbl, nesadapter->max_4kpbl); - } - } else { - nesadapter->free_256pbl += nesmr->pbls_used; - if (nesadapter->free_256pbl > nesadapter->max_256pbl) { - printk(KERN_ERR PFX "free 256B PBLs(%u) has exceeded the max(%u)\n", - nesadapter->free_256pbl, nesadapter->max_256pbl); - } - } - } - - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); nes_fill_init_cqp_wqe(cqp_wqe, nesdev); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_OPCODE_IDX, NES_CQP_DEALLOCATE_STAG | NES_CQP_STAG_VA_TO | @@ -2642,11 +2713,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) " CQP Major:Minor codes = 0x%04X:0x%04X\n", ib_mr->rkey, ret, cqp_request->major_code, cqp_request->minor_code); - nes_free_resource(nesadapter, nesadapter->allocated_mrs, - (ib_mr->rkey & 0x0fffff00) >> 8); - - kfree(nesmr); - major_code = cqp_request->major_code; minor_code = cqp_request->minor_code; @@ -2662,8 +2728,35 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) " to destroy STag, ib_mr=%p, rkey = 0x%08X\n", major_code, minor_code, ib_mr, ib_mr->rkey); return -EIO; - } else - return 0; + } + + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (nesmr->pbls_used != 0) { + if (nesmr->pbl_4k) { + nesadapter->free_4kpbl += nesmr->pbls_used; + if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) + printk(KERN_ERR PFX "free 4KB PBLs(%u) has " + "exceeded the max(%u)\n", + nesadapter->free_4kpbl, + nesadapter->max_4kpbl); + } else { + nesadapter->free_256pbl += nesmr->pbls_used; + if (nesadapter->free_256pbl > nesadapter->max_256pbl) + printk(KERN_ERR PFX "free 256B PBLs(%u) has " + "exceeded the max(%u)\n", + nesadapter->free_256pbl, + nesadapter->max_256pbl); + } + } + + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + (ib_mr->rkey & 0x0fffff00) >> 8); + + kfree(nesmr); + + return 0; } -- 1.5.3.3 From alexxy at gentoo.org Wed Jan 21 10:04:30 2009 From: alexxy at gentoo.org (Alexey Shvetsov) Date: Wed, 21 Jan 2009 21:04:30 +0300 Subject: ***SPAM*** Re: [ofa-general] OFED-1.4 and Gentoo In-Reply-To: <20090121153159.GE3479@sashak.voltaire.com> References: <200901211755.28192.alexxy@gentoo.org> <20090121153159.GE3479@sashak.voltaire.com> Message-ID: <200901212104.37112.alexxy@gentoo.org> On Среда 21 января 2009 18:31:59 Sasha Khapyorsky wrote: > On 17:55 Wed 21 Jan , Alexey Shvetsov wrote: > > I want to add OFED-1.4 to gentoo main tree but its a little bit annoying > > since OFED-1.4.tgz contains src rpms with regular tarbolls inside > > Can OpenFabrics org devs make split OFED-1.4 or next OFED release > > tarbolls publicaly available? > > IB management (libibumad, OpenSM, infiniband-diags) and some other > tarballs are publicly available at: > > http://www.openfabrics.org/downloads/ > > (see also http://www.openfabrics.org/download_linux.htm) > > Sasha I know but there not all components that included into OFED -- Alexey 'Alexxy' Shvetsov Gentoo Team Ru -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part. URL: From sashak at voltaire.com Wed Jan 21 10:24:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 20:24:33 +0200 Subject: [ofa-general] Re: [PATCH 3/3 - no ibcommon] Convert ibnetdiscover to use new ibnetdisc library. In-Reply-To: <20090109154759.3f5d97b2.weiny2@llnl.gov> References: <20090109154759.3f5d97b2.weiny2@llnl.gov> Message-ID: <20090121182433.GI3479@sashak.voltaire.com> On 15:47 Fri 09 Jan , Ira Weiny wrote: > From 71940caa935c4757b0b4d4368ecafd34eb1d6dc1 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Tue, 2 Dec 2008 16:29:29 -0800 > Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library. > > Removed -e and -v since they were somewhat redundant with the -d option. I'm not so happy with those removes. Why is it redundant? '-v' turns inetdiscover verbosity, '-d' - madrpc and umad debug level, '-e' - shows madrpc errors. Also those options are common over many infiniband-diags tools. > > All other functionality is preserved > > Signed-off-by: weiny2 at llnl.gov > --- > infiniband-diags/Makefile.am | 4 +- > infiniband-diags/include/grouping.h | 113 ---- > infiniband-diags/man/ibnetdiscover.8 | 10 +- > infiniband-diags/scripts/dump_lfts.sh | 2 +- > infiniband-diags/scripts/dump_mfts.sh | 2 +- > infiniband-diags/src/grouping.c | 786 -------------------------- > infiniband-diags/src/ibnetdiscover.c | 984 +++++++++++---------------------- > 7 files changed, 315 insertions(+), 1586 deletions(-) > delete mode 100644 infiniband-diags/include/grouping.h > delete mode 100644 infiniband-diags/src/grouping.c > > diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am > index d127a4d..2ccf082 100644 > --- a/infiniband-diags/Makefile.am > +++ b/infiniband-diags/Makefile.am > @@ -37,9 +37,9 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \ > src_ibaddr_SOURCES = src/ibaddr.c src/ibdiag_common.c > src_ibaddr_CFLAGS = -Wall $(DBGFLAGS) > > -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common.c > +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/ibdiag_common.c > src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS) > -src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) > +src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) -L$(srcdir)/libibnetdisc -libnetdisc There we probably need to use $(top_builddir) instead of $(srcdir) in order to support configuration and build in separate tree. > > src_iblinkinfo_pl_SOURCES = src/iblinkinfo.c > src_iblinkinfo_pl_CFLAGS = -Wall $(DBGFLAGS) [snip...] > diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c > index 296cb07..0c4aa13 100644 > --- a/infiniband-diags/src/ibnetdiscover.c > +++ b/infiniband-diags/src/ibnetdiscover.c > @@ -1,6 +1,7 @@ > /* > * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. > * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -50,479 +51,104 @@ > #include > #include > #include > +#include > > -#include "ibnetdiscover.h" > -#include "grouping.h" > #include "ibdiag_common.h" > > -static char *node_type_str[] = { > - "???", > - "ca", > - "switch", > - "router", > - "iwarp rnic" > -}; > - > -static char *linkwidth_str[] = { > - "??", > - "1x", > - "4x", > - "??", > - "8x", > - "??", > - "??", > - "??", > - "12x" > -}; > - > -static char *linkspeed_str[] = { > - "???", > - "SDR", > - "DDR", > - "???", > - "QDR" > -}; > - > -static int timeout = 2000; /* ms */ > -static int dumplevel = 0; > +static int debug; > static int verbose; > -static FILE *f; > +#define LIST_CA_NODE (1 << IBND_CA_NODE) > +#define LIST_SWITCH_NODE (1 << IBND_SWITCH_NODE) > +#define LIST_ROUTER_NODE (1 << IBND_ROUTER_NODE) > > char *argv0 = "ibnetdiscover"; > +static FILE *f; > > static char *node_name_map_file = NULL; > static nn_map_t *node_name_map = NULL; > > -Node *nodesdist[MAXHOPS+1]; /* last is Ca list */ > -Node *mynode; > -int maxhops_discovered = 0; > - > -struct ChassisList *chassis = NULL; > - > -static char * > -get_linkwidth_str(int linkwidth) > -{ > - if (linkwidth > 8) > - return linkwidth_str[0]; > - else > - return linkwidth_str[linkwidth]; > -} > - > -static char * > -get_linkspeed_str(int linkspeed) > -{ > - if (linkspeed > 4) > - return linkspeed_str[0]; > - else > - return linkspeed_str[linkspeed]; > -} > - > -static inline const char* > -node_type_str2(Node *node) > -{ > - switch(node->type) { > - case SWITCH_NODE: return "SW"; > - case CA_NODE: return "CA"; > - case ROUTER_NODE: return "RT"; > - } > - return "??"; > -} > - > -void > -decode_port_info(void *pi, Port *port) > -{ > - mad_decode_field(pi, IB_PORT_LID_F, &port->lid); > - mad_decode_field(pi, IB_PORT_LMC_F, &port->lmc); > - mad_decode_field(pi, IB_PORT_STATE_F, &port->state); > - mad_decode_field(pi, IB_PORT_PHYS_STATE_F, &port->physstate); > - mad_decode_field(pi, IB_PORT_LINK_WIDTH_ACTIVE_F, &port->linkwidth); > - mad_decode_field(pi, IB_PORT_LINK_SPEED_ACTIVE_F, &port->linkspeed); > -} > - > - > -int > -get_port(Port *port, int portnum, ib_portid_t *portid) > -{ > - char portinfo[64]; > - void *pi = portinfo; > - > - port->portnum = portnum; > - > - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout)) > - return -1; > - decode_port_info(pi, port); > - > - DEBUG("portid %s portnum %d: lid %d state %d physstate %d %s %s", > - portid2str(portid), portnum, port->lid, port->state, port->physstate, get_linkwidth_str(port->linkwidth), get_linkspeed_str(port->linkspeed)); > - return 1; > -} > -/* > - * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. > - */ > -int > -get_node(Node *node, Port *port, ib_portid_t *portid) > -{ > - char portinfo[64]; > - char switchinfo[64]; > - void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc; > - void *si = switchinfo; > - > - if (!smp_query(ni, portid, IB_ATTR_NODE_INFO, 0, timeout)) > - return -1; > - > - mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid); > - mad_decode_field(ni, IB_NODE_TYPE_F, &node->type); > - mad_decode_field(ni, IB_NODE_NPORTS_F, &node->numports); > - mad_decode_field(ni, IB_NODE_DEVID_F, &node->devid); > - mad_decode_field(ni, IB_NODE_VENDORID_F, &node->vendid); > - mad_decode_field(ni, IB_NODE_SYSTEM_GUID_F, &node->sysimgguid); > - mad_decode_field(ni, IB_NODE_PORT_GUID_F, &node->portguid); > - mad_decode_field(ni, IB_NODE_LOCAL_PORT_F, &node->localport); > - port->portnum = node->localport; > - port->portguid = node->portguid; > - > - if (!smp_query(nd, portid, IB_ATTR_NODE_DESC, 0, timeout)) > - return -1; > - > - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, 0, timeout)) > - return -1; > - decode_port_info(pi, port); > - > - if (node->type != SWITCH_NODE) > - return 0; > - > - node->smalid = port->lid; > - node->smalmc = port->lmc; > - > - /* after we have the sma information find out the real PortInfo for this port */ > - if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, node->localport, timeout)) > - return -1; > - decode_port_info(pi, port); > - > - if (!smp_query(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout)) > - node->smaenhsp0 = 0; /* assume base SP0 */ > - else > - mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); > - > - DEBUG("portid %s: got switch node %" PRIx64 " '%s'", > - portid2str(portid), node->nodeguid, node->nodedesc); > - return 1; > -} > - > -static int > -extend_dpath(ib_dr_path_t *path, int nextport) > -{ > - if (path->cnt+2 >= sizeof(path->p)) > - return -1; > - ++path->cnt; > - if (path->cnt > maxhops_discovered) > - maxhops_discovered = path->cnt; > - path->p[path->cnt] = nextport; > - return path->cnt; > -} > - > -static void > -dump_endnode(ib_portid_t *path, char *prompt, Node *node, Port *port) > -{ > - if (!dumplevel) > - return; > - > - fprintf(f, "%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n", > - portid2str(path), prompt, > - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), > - node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum, > - port->lid, port->lid + (1 << port->lmc) - 1, > - clean_nodedesc(node->nodedesc)); > -} > - > -#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) > -#define HTSZ 137 > - > -static Node *nodestbl[HTSZ]; > - > -static Node * > -find_node(Node *new) > -{ > - int hash = HASHGUID(new->nodeguid) % HTSZ; > - Node *node; > - > - for (node = nodestbl[hash]; node; node = node->htnext) > - if (node->nodeguid == new->nodeguid) > - return node; > - > - return NULL; > -} > - > -static Node * > -create_node(Node *temp, ib_portid_t *path, int dist) > -{ > - Node *node; > - int hash = HASHGUID(temp->nodeguid) % HTSZ; > - > - node = malloc(sizeof(*node)); > - if (!node) > - return NULL; > - > - memcpy(node, temp, sizeof(*node)); > - node->dist = dist; > - node->path = *path; > - > - node->htnext = nodestbl[hash]; > - nodestbl[hash] = node; > - > - if (node->type != SWITCH_NODE) > - dist = MAXHOPS; /* special Ca list */ > - > - node->dnext = nodesdist[dist]; > - nodesdist[dist] = node; > - > - return node; > -} > - > -static Port * > -find_port(Node *node, Port *port) > -{ > - Port *old; > - > - for (old = node->ports; old; old = old->next) > - if (old->portnum == port->portnum) > - return old; > - > - return NULL; > -} > - > -static Port * > -create_port(Node *node, Port *temp) > -{ > - Port *port; > - > - port = malloc(sizeof(*port)); > - if (!port) > - return NULL; > - > - memcpy(port, temp, sizeof(*port)); > - port->node = node; > - port->next = node->ports; > - node->ports = port; > - > - return port; > -} > - > -static void > -link_ports(Node *node, Port *port, Node *remotenode, Port *remoteport) > -{ > - DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u", > - node->nodeguid, node, port, port->portnum, > - remotenode->nodeguid, remotenode, remoteport, remoteport->portnum); > - if (port->remoteport) > - port->remoteport->remoteport = NULL; > - if (remoteport->remoteport) > - remoteport->remoteport->remoteport = NULL; > - port->remoteport = remoteport; > - remoteport->remoteport = port; > -} > - > -static int > -handle_port(Node *node, Port *port, ib_portid_t *path, int portnum, int dist) > -{ > - Node node_buf; > - Port port_buf; > - Node *remotenode, *oldnode; > - Port *remoteport, *oldport; > - > - memset(&node_buf, 0, sizeof(node_buf)); > - memset(&port_buf, 0, sizeof(port_buf)); > - > - DEBUG("handle node %p port %p:%d dist %d", node, port, portnum, dist); > - if (port->physstate != 5) /* LinkUp */ > - return -1; > - > - if (extend_dpath(&path->drpath, portnum) < 0) > - return -1; > - > - if (get_node(&node_buf, &port_buf, path) < 0) { > - IBWARN("NodeInfo on %s failed, skipping port", > - portid2str(path)); > - path->drpath.cnt--; /* restore path */ > - return -1; > - } > - > - oldnode = find_node(&node_buf); > - if (oldnode) > - remotenode = oldnode; > - else if (!(remotenode = create_node(&node_buf, path, dist + 1))) > - IBERROR("no memory"); > - > - oldport = find_port(remotenode, &port_buf); > - if (oldport) { > - remoteport = oldport; > - if (node != remotenode || port != remoteport) > - IBWARN("port moving..."); > - } else if (!(remoteport = create_port(remotenode, &port_buf))) > - IBERROR("no memory"); > - > - dump_endnode(path, oldnode ? "known remote" : "new remote", > - remotenode, remoteport); > - > - link_ports(node, port, remotenode, remoteport); > - > - path->drpath.cnt--; /* restore path */ > - return 0; > -} > - > -/* > - * Return 1 if found, 0 if not, -1 on errors. > - */ > -static int > -discover(ib_portid_t *from) > -{ > - Node node_buf; > - Port port_buf; > - Node *node; > - Port *port; > - int i; > - int dist = 0; > - ib_portid_t *path; > - > - DEBUG("from %s", portid2str(from)); > - > - memset(&node_buf, 0, sizeof(node_buf)); > - memset(&port_buf, 0, sizeof(port_buf)); > - > - if (get_node(&node_buf, &port_buf, from) < 0) { > - IBWARN("can't reach node %s", portid2str(from)); > - return -1; > - } > - > - node = create_node(&node_buf, from, 0); > - if (!node) > - IBERROR("out of memory"); > +static int timeout_ms = 2000; > > - mynode = node; > - > - port = create_port(node, &port_buf); > - if (!port) > - IBERROR("out of memory"); > - > - if (node->type != SWITCH_NODE && > - handle_port(node, port, from, node->localport, 0) < 0) > - return 0; > - > - for (dist = 0; dist < MAXHOPS; dist++) { > - > - for (node = nodesdist[dist]; node; node = node->dnext) { > - > - path = &node->path; > - > - DEBUG("dist %d node %p", dist, node); > - dump_endnode(path, "processing", node, port); > - > - for (i = 1; i <= node->numports; i++) { > - if (i == node->localport) > - continue; > - > - if (get_port(&port_buf, i, path) < 0) { > - IBWARN("can't reach node %s port %d", portid2str(path), i); > - continue; > - } > - > - port = find_port(node, &port_buf); > - if (port) > - continue; > - > - port = create_port(node, &port_buf); > - if (!port) > - IBERROR("out of memory"); > - > - /* If switch, set port GUID to node GUID */ > - if (node->type == SWITCH_NODE) > - port->portguid = node->portguid; > - > - handle_port(node, port, path, i, dist); > - } > - } > - } > - > - return 0; > -} > > char * > -node_name(Node *node) > +node_name(ibnd_node_t *node) > { > static char buf[256]; > > - switch(node->type) { > - case SWITCH_NODE: > - sprintf(buf, "\"%s", "S"); > - break; > - case CA_NODE: > + switch(node->info.type) { > + case IBND_CA_NODE: > sprintf(buf, "\"%s", "H"); > break; > - case ROUTER_NODE: > + case IBND_SWITCH_NODE: > + sprintf(buf, "\"%s", "S"); > + break; > + case IBND_ROUTER_NODE: > sprintf(buf, "\"%s", "R"); > break; > default: > sprintf(buf, "\"%s", "?"); > break; > } > - sprintf(buf+2, "-%016" PRIx64 "\"", node->nodeguid); > + sprintf(buf+2, "-%016" PRIx64 "\"", node->info.nodeguid); > > return buf; > } > > void > -list_node(Node *node) > +list_node(ibnd_node_t *node, void *user_data) > { > - char *node_type; > - char *nodename = remap_node_name(node_name_map, node->nodeguid, > + char *nodename = remap_node_name(node_name_map, node->info.nodeguid, > node->nodedesc); > > - switch(node->type) { > - case SWITCH_NODE: > - node_type = "Switch"; > - break; > - case CA_NODE: > - node_type = "Ca"; > - break; > - case ROUTER_NODE: > - node_type = "Router"; > - break; > - default: > - node_type = "???"; > - break; > - } > fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n", > - node_type, > - node->nodeguid, node->numports, node->devid, node->vendid, > + ibnd_node_type_str(node), > + node->info.nodeguid, node->info.numports, node->info.devid, > + node->info.vendid, > nodename); > > free(nodename); > } > > void > -out_ids(Node *node, int group, char *chname) > +list_nodes(ibnd_fabric_t *fabric, int list) > +{ > + if (list & LIST_CA_NODE) { > + ibnd_iter_nodes_type(fabric, list_node, IBND_CA_NODE, NULL); > + } > + if (list & LIST_SWITCH_NODE) { > + ibnd_iter_nodes_type(fabric, list_node, IBND_SWITCH_NODE, NULL); > + } > + if (list & LIST_ROUTER_NODE) { > + ibnd_iter_nodes_type(fabric, list_node, IBND_ROUTER_NODE, NULL); > + } > +} > + > +void > +out_ids(ibnd_node_t *node, int group, char *chname) > { > - fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); > - if (node->sysimgguid) > - fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); > - if (group > - && node->chrecord && node->chrecord->chassisnum) { > - fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); > + fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->info.vendid, node->info.devid); > + if (node->info.sysimgguid) > + fprintf(f, "sysimgguid=0x%" PRIx64, node->info.sysimgguid); > + if (group && node->chassis && node->chassis->chassisnum) { > + fprintf(f, "\t\t# Chassis %d", node->chassis->chassisnum); > if (chname) > - fprintf(f, " (%s)", chname); > - if (is_xsigo_tca(node->nodeguid) && node->ports->remoteport) > - fprintf(f, " slot %d", node->ports->remoteport->portnum); > + fprintf(f, " (%s)", clean_nodedesc(chname)); > + if (ibnd_is_xsigo_tca(node->info.nodeguid) > + && node->ports[1] > + && node->ports[1]->remoteport) > + fprintf(f, " slot %d", node->ports[1]->remoteport->portnum); > } > fprintf(f, "\n"); > } > > + > uint64_t > -out_chassis(int chassisnum) > +out_chassis(ibnd_fabric_t *fabric, int chassisnum) > { > uint64_t guid; > > fprintf(f, "\nChassis %d", chassisnum); > - guid = get_chassis_guid(chassisnum); > + guid = ibnd_get_chassis_guid(fabric, chassisnum); > if (guid) > fprintf(f, " (guid 0x%" PRIx64 ")", guid); > fprintf(f, "\n"); > @@ -530,54 +156,49 @@ out_chassis(int chassisnum) > } > > void > -out_switch(Node *node, int group, char *chname) > +out_switch(ibnd_node_t *node, int group, char *chname) > { > char *str; > + char str2[256]; > char *nodename = NULL; > > out_ids(node, group, chname); > - fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); > - fprintf(f, "(%" PRIx64 ")", node->portguid); > - /* Currently, only if Voltaire chassis */ > - if (group > - && node->chrecord && node->chrecord->chassisnum > - && node->vendid == VTR_VENDOR_ID) { > - str = get_chassis_type(node->chrecord->chassistype); > + fprintf(f, "switchguid=0x%" PRIx64, node->info.nodeguid); > + fprintf(f, "(%" PRIx64 ")", node->info.nodeportguid); > + if (group) { > + str = ibnd_get_chassis_type(node); > if (str) > fprintf(f, "%s ", str); > - str = get_chassis_slot(node->chrecord->chassisslot); > + str = ibnd_get_chassis_slot_str(node, str2, 256); > if (str) > - fprintf(f, "%s ", str); > - fprintf(f, "%d Chip %d", node->chrecord->slotnum, node->chrecord->anafanum); > + fprintf(f, "%s", str); > } > > - nodename = remap_node_name(node_name_map, node->nodeguid, > + nodename = remap_node_name(node_name_map, node->info.nodeguid, > node->nodedesc); > > fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n", > - node->numports, node_name(node), > + node->info.numports, node_name(node), > nodename, > - node->smaenhsp0 ? "enhanced" : "base", > + node->sw_info.smaenhsp0 ? "enhanced" : "base", > node->smalid, node->smalmc); > > free(nodename); > } > > void > -out_ca(Node *node, int group, char *chname) > +out_ca(ibnd_node_t *node, int group, char *chname) > { > char *node_type; > char *node_type2; > - char *nodename = remap_node_name(node_name_map, node->nodeguid, > - node->nodedesc); > > out_ids(node, group, chname); > - switch(node->type) { > - case CA_NODE: > + switch(node->info.type) { > + case IBND_CA_NODE: > node_type = "ca"; > node_type2 = "Ca"; > break; > - case ROUTER_NODE: > + case IBND_ROUTER_NODE: > node_type = "rt"; > node_type2 = "Rt"; > break; > @@ -587,37 +208,37 @@ out_ca(Node *node, int group, char *chname) > break; > } > > - fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); > + fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->info.nodeguid); > fprintf(f, "%s\t%d %s\t\t# \"%s\"", > - node_type2, node->numports, node_name(node), > - nodename); > - if (group && is_xsigo_hca(node->nodeguid)) > + node_type2, node->info.numports, node_name(node), > + clean_nodedesc(node->nodedesc)); > + if (group && ibnd_is_xsigo_hca(node->info.nodeguid)) > fprintf(f, " (scp)"); > fprintf(f, "\n"); > - > - free(nodename); > } > > +#define OUT_BUFFER_SIZE 16 > static char * > -out_ext_port(Port *port, int group) > +out_ext_port(ibnd_port_t *port, int group) > { > - char *str = NULL; > + static char mapping[OUT_BUFFER_SIZE]; Maybe instead of using static here it would be better to pass print buffer to this function and return number of printed characters? > > - /* Currently, only if Voltaire chassis */ > - if (group > - && port->node->chrecord && port->node->vendid == VTR_VENDOR_ID) > - str = portmapstring(port); > + if (group && port->ext_portnum != 0) { > + snprintf(mapping, OUT_BUFFER_SIZE, > + "[ext %d]", port->ext_portnum); > + return (mapping); > + } > > - return (str); > + return (NULL); > } > > void > -out_switch_port(Port *port, int group) > +out_switch_port(ibnd_port_t *port, int group) > { > char *ext_port_str = NULL; > char *rem_nodename = NULL; > > - DEBUG("port %p:%d remoteport %p", port, port->portnum, port->remoteport); > + DEBUG("port %p:%d remoteport %p\n", port, port->portnum, port->remoteport); > fprintf(f, "[%d]", port->portnum); > > ext_port_str = out_ext_port(port, group); > @@ -625,7 +246,7 @@ out_switch_port(Port *port, int group) > fprintf(f, "%s", ext_port_str); > > rem_nodename = remap_node_name(node_name_map, > - port->remoteport->node->nodeguid, > + port->remoteport->node->info.nodeguid, > port->remoteport->node->nodedesc); > > ext_port_str = out_ext_port(port->remoteport, group); > @@ -633,17 +254,19 @@ out_switch_port(Port *port, int group) > node_name(port->remoteport->node), > port->remoteport->portnum, > ext_port_str ? ext_port_str : ""); > - if (port->remoteport->node->type != SWITCH_NODE) > - fprintf(f, "(%" PRIx64 ") ", port->remoteport->portguid); > + if (port->remoteport->node->info.type != IBND_SWITCH_NODE) > + fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid); > fprintf(f, "\t\t# \"%s\" lid %d %s%s", > rem_nodename, > - port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, > - get_linkwidth_str(port->linkwidth), > - get_linkspeed_str(port->linkspeed)); > + port->remoteport->node->info.type == IBND_SWITCH_NODE ? > + port->remoteport->node->smalid : > + port->remoteport->info.base_lid, > + ibnd_linkwidth_str(port->info.link_width_active), > + ibnd_linkspeed_str(port->info.link_speed_active, 0)); > > - if (is_xsigo_tca(port->remoteport->portguid)) > + if (ibnd_is_xsigo_tca(port->remoteport->guid)) > fprintf(f, " slot %d", port->portnum); > - else if (is_xsigo_hca(port->remoteport->portguid)) > + else if (ibnd_is_xsigo_hca(port->remoteport->guid)) > fprintf(f, " (scp)"); > fprintf(f, "\n"); > > @@ -651,278 +274,294 @@ out_switch_port(Port *port, int group) > } > > void > -out_ca_port(Port *port, int group) > +out_ca_port(ibnd_port_t *port, int group) > { > char *str = NULL; > char *rem_nodename = NULL; > > fprintf(f, "[%d]", port->portnum); > - if (port->node->type != SWITCH_NODE) > - fprintf(f, "(%" PRIx64 ") ", port->portguid); > + if (port->node->info.type != IBND_SWITCH_NODE) > + fprintf(f, "(%" PRIx64 ") ", port->guid); > fprintf(f, "\t%s[%d]", > node_name(port->remoteport->node), > port->remoteport->portnum); > str = out_ext_port(port->remoteport, group); > if (str) > fprintf(f, "%s", str); > - if (port->remoteport->node->type != SWITCH_NODE) > - fprintf(f, " (%" PRIx64 ") ", port->remoteport->portguid); > + if (port->remoteport->node->info.type != IBND_SWITCH_NODE) > + fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid); > > rem_nodename = remap_node_name(node_name_map, > - port->remoteport->node->nodeguid, > + port->remoteport->node->info.nodeguid, > port->remoteport->node->nodedesc); > > fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n", > - port->lid, port->lmc, rem_nodename, > - port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, > - get_linkwidth_str(port->linkwidth), > - get_linkspeed_str(port->linkspeed)); > + port->info.base_lid, port->info.lmc, rem_nodename, > + port->remoteport->node->info.type == IBND_SWITCH_NODE ? > + port->remoteport->node->smalid : > + port->remoteport->info.base_lid, > + ibnd_linkwidth_str(port->info.link_width_active), > + ibnd_linkspeed_str(port->info.link_speed_active, 0)); > > free(rem_nodename); > } > > +struct iter_user_data { > + int group; > + int skip_chassis_nodes; > +}; > + > +static void > +switch_iter_func(ibnd_node_t *node, void *iter_user_data) > +{ > + ibnd_port_t *port; > + int p = 0; > + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; Casting from void * is not necessary. > + > + DEBUG("SWITCH: node %p\n", node); > + > + /* skip chassis based switches if flagged */ > + if (data->skip_chassis_nodes && node->chassis && node->chassis->chassisnum) > + return; > + > + out_switch(node, data->group, NULL); > + for (p = 1; p <= node->info.numports; p++) { > + port = node->ports[p]; > + if (port && port->remoteport) > + out_switch_port(port, data->group); > + } > +} > + > +static void > +ca_iter_func(ibnd_node_t *node, void *iter_user_data) > +{ > + ibnd_port_t *port; > + int p = 0; > + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; Ditto. > + > + DEBUG("CA: node %p\n", node); > + /* Now, skip chassis based CAs */ > + if (data->group && node->chassis && node->chassis->chassisnum) > + return; > + out_ca(node, data->group, NULL); > + > + for (p = 1; p <= node->info.numports; p++) { > + port = node->ports[p]; > + if (port && port->remoteport) > + out_ca_port(port, data->group); > + } > +} > + > +static void > +router_iter_func(ibnd_node_t *node, void *iter_user_data) > +{ > + ibnd_port_t *port; > + int p = 0; > + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; Ditto. > + > + DEBUG("RT: node %p\n", node); > + /* Now, skip chassis based RTs */ > + if (data->group && node->chassis && > + node->chassis->chassisnum) > + return; > + out_ca(node, data->group, NULL); > + for (p = 1; p <= node->info.numports; p++) { > + port = node->ports[p]; > + if (port && port->remoteport) > + out_ca_port(port, data->group); > + } > +} > + > int > -dump_topology(int listtype, int group) > +dump_topology(int group, ibnd_fabric_t *fabric) > { > - Node *node; > - Port *port; > - int i = 0, dist = 0; > + ibnd_node_t *node; > + ibnd_port_t *port; > + int i = 0, p = 0; > time_t t = time(0); > uint64_t chguid; > char *chname = NULL; > + struct iter_user_data iter_user_data; > > - if (!listtype) { > - fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); > - fprintf(f, "# Max of %d hops discovered\n", maxhops_discovered); > - fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", mynode->nodeguid, mynode->portguid); > - } > + fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); > + fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered); > + fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", > + fabric->from_node->info.nodeguid, fabric->from_node->info.nodeportguid); I don't really know why this banner was printed for !listtype case only, but probably the idea was that when not full subnet is shown then it is not "topology file". > > /* Make pass on switches */ > - if (group && !listtype) { > - ChassisList *ch = NULL; > + if (group) { > + ibnd_chassis_t *ch = NULL; > > /* Chassis based switches first */ > - for (ch = chassis; ch; ch = ch->next) { > + for (ch = fabric->chassis; ch; ch = ch->next) { > int n = 0; > > if (!ch->chassisnum) > continue; > - chguid = out_chassis(ch->chassisnum); > - if (chname) > - free(chname); > + chguid = out_chassis(fabric, ch->chassisnum); > + > chname = NULL; > - if (is_xsigo_guid(chguid)) { > - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > - if (!node->chrecord || > - !node->chrecord->chassisnum) > +/** > + * Will this work for Xsigo? > + */ > + if (ibnd_is_xsigo_guid(chguid)) { > + for (node = ch->nodes; node; > + node = node->next_chassis_node) { > + if (ibnd_is_xsigo_hca(node->info.nodeguid)) { > + chname = node->nodedesc; > + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); > + } > + } > + > +#if 0 > +/** > + * vs. this? > + * I don't want to expose the nodesdist array to the end user. > + */ > + for (node = fabric->nodesdist[MAXHOPS]; node; node = node->dnext) { > + if (!node->chassis || > + !node->chassis->chassisnum) > continue; > > - if (node->chrecord->chassisnum != ch->chassisnum) > + if (node->chassis->chassisnum != ch->chassisnum) > continue; > > - if (is_xsigo_hca(node->nodeguid)) { > - chname = remap_node_name(node_name_map, > - node->nodeguid, > - node->nodedesc); > - fprintf(f, "Hostname: %s\n", chname); > + if (ibnd_is_xsigo_hca(node->nodeguid)) { > + chname = node->nodedesc; > + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); > } > } > +#endif Should we keep #if 0 sections? > } > > fprintf(f, "\n# Spine Nodes"); > - for (n = 1; n <= (SPINES_MAX_NUM+1); n++) { > + for (n = 1; n <= SPINES_MAX_NUM; n++) { > if (ch->spinenode[n]) { > out_switch(ch->spinenode[n], group, chname); > - for (port = ch->spinenode[n]->ports; port; port = port->next, i++) > - if (port->remoteport) > + for (p = 1; p <= ch->spinenode[n]->info.numports; p++) { > + port = ch->spinenode[n]->ports[p]; > + if (port && port->remoteport) > out_switch_port(port, group); > + } > } > } > fprintf(f, "\n# Line Nodes"); > - for (n = 1; n <= (LINES_MAX_NUM+1); n++) { > + for (n = 1; n <= LINES_MAX_NUM; n++) { > if (ch->linenode[n]) { > out_switch(ch->linenode[n], group, chname); > - for (port = ch->linenode[n]->ports; port; port = port->next, i++) > - if (port->remoteport) > + for (p = 1; p <= ch->linenode[n]->info.numports; p++) { > + port = ch->linenode[n]->ports[p]; > + if (port && port->remoteport) > out_switch_port(port, group); > + } > } > } > > fprintf(f, "\n# Chassis Switches"); > - for (dist = 0; dist <= maxhops_discovered; dist++) { > - > - for (node = nodesdist[dist]; node; node = node->dnext) { > - > - /* Non Voltaire chassis */ > - if (node->vendid == VTR_VENDOR_ID) > - continue; > - if (!node->chrecord || > - !node->chrecord->chassisnum) > - continue; > - > - if (node->chrecord->chassisnum != ch->chassisnum) > - continue; > - > + for (node = ch->nodes; node; > + node = node->next_chassis_node) { > + if (node->info.type == IBND_SWITCH_NODE) { > out_switch(node, group, chname); > - for (port = node->ports; port; port = port->next, i++) > - if (port->remoteport) > + for (p = 1; p <= node->info.numports; p++) { > + port = node->ports[p]; > + if (port && port->remoteport) > out_switch_port(port, group); > - > + } > } > - > } > > fprintf(f, "\n# Chassis CAs"); > - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > - if (!node->chrecord || > - !node->chrecord->chassisnum) > - continue; > - > - if (node->chrecord->chassisnum != ch->chassisnum) > - continue; > - > - out_ca(node, group, chname); > - for (port = node->ports; port; port = port->next, i++) > - if (port->remoteport) > - out_ca_port(port, group); > - > + for (node = ch->nodes; node; > + node = node->next_chassis_node) { > + if (node->info.type == IBND_CA_NODE) { > + out_ca(node, group, chname); > + for (p = 1; p <= node->info.numports; p++) { > + port = node->ports[p]; > + if (port && port->remoteport) > + out_ca_port(port, group); > + } > + } > } > > } > > - } else { > - for (dist = 0; dist <= maxhops_discovered; dist++) { > + } else { /* !group */ > + iter_user_data.group = group; > + iter_user_data.skip_chassis_nodes = 0; > > - for (node = nodesdist[dist]; node; node = node->dnext) { > - > - DEBUG("SWITCH: dist %d node %p", dist, node); > - if (!listtype) > - out_switch(node, group, chname); > - else { > - if (listtype & LIST_SWITCH_NODE) > - list_node(node); > - continue; > - } > - > - for (port = node->ports; port; port = port->next, i++) > - if (port->remoteport) > - out_switch_port(port, group); > - } > - } > + ibnd_iter_nodes_type(fabric, switch_iter_func, > + IBND_SWITCH_NODE, &iter_user_data); > } > > - if (chname) > - free(chname); > chname = NULL; > - if (group && !listtype) { > + if (group) { > + iter_user_data.group = group; > + iter_user_data.skip_chassis_nodes = 1; > > fprintf(f, "\nNon-Chassis Nodes\n"); > - > - for (dist = 0; dist <= maxhops_discovered; dist++) { > - > - for (node = nodesdist[dist]; node; node = node->dnext) { > - > - DEBUG("SWITCH: dist %d node %p", dist, node); > - /* Now, skip chassis based switches */ > - if (node->chrecord && > - node->chrecord->chassisnum) > - continue; > - out_switch(node, group, chname); > - > - for (port = node->ports; port; port = port->next, i++) > - if (port->remoteport) > - out_switch_port(port, group); > - } > - > - } > + ibnd_iter_nodes_type(fabric, switch_iter_func, > + IBND_SWITCH_NODE, &iter_user_data); > > } > > - /* Make pass on CAs */ > - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { > + iter_user_data.group = group; > + iter_user_data.skip_chassis_nodes = 0; > > - DEBUG("CA: dist %d node %p", dist, node); > - if (!listtype) { > - /* Now, skip chassis based CAs */ > - if (group && node->chrecord && > - node->chrecord->chassisnum) > - continue; > - out_ca(node, group, chname); > - } else { > - if (((listtype & LIST_CA_NODE) && (node->type == CA_NODE)) || > - ((listtype & LIST_ROUTER_NODE) && (node->type == ROUTER_NODE))) > - list_node(node); > - continue; > - } > - > - for (port = node->ports; port; port = port->next, i++) > - if (port->remoteport) > - out_ca_port(port, group); > - } > + /* Make pass on CAs */ > + ibnd_iter_nodes_type(fabric, ca_iter_func, IBND_CA_NODE, > + &iter_user_data); > > - if (chname) > - free(chname); > + /* make pass on routers */ > + ibnd_iter_nodes_type(fabric, router_iter_func, IBND_ROUTER_NODE, > + &iter_user_data); > > return i; > } > > -void dump_ports_report () > + > +void dump_ports_report (ibnd_node_t *node, void *user_data) > { > - int b, n = 0, p; > - Node *node; > - Port *port; > - > - // If switch and LID == 0, search of other switch ports with > - // valid LID and assign it to all ports of that switch > - for (b = 0; b <= MAXHOPS; b++) > - for (node = nodesdist[b]; node; node = node->dnext) > - if (node->type == SWITCH_NODE) { > - int swlid = 0; > - for (p = 0, port = node->ports; > - p < node->numports && port && !swlid; > - port = port->next) > - if (port->lid != 0) > - swlid = port->lid; > - for (p = 0, port = node->ports; > - p < node->numports && port; > - port = port->next) > - port->lid = swlid; > - } > + int p = 0; > + ibnd_port_t *port = NULL; > + > + /* for each port */ > + for (p = node->info.numports, port = node->ports[p]; > + p > 0; > + port = node->ports[--p]) { > + if (port == NULL) > + continue; > > - for (b = 0; b <= MAXHOPS; b++) > - for (node = nodesdist[b]; node; node = node->dnext) { > - for (p = 0, port = node->ports; > - p < node->numports && port; > - p++, port = port->next) { > - fprintf(stdout, > - "%2s %5d %2d 0x%016" PRIx64 " %s %s", > - node_type_str2(port->node), port->lid, > - port->portnum, > - port->portguid, > - get_linkwidth_str(port->linkwidth), > - get_linkspeed_str(port->linkspeed)); > - if (port->remoteport) > - fprintf(stdout, > - " - %2s %5d %2d 0x%016" PRIx64 > - " ( '%s' - '%s' )\n", > - node_type_str2(port->remoteport->node), > - port->remoteport->lid, > - port->remoteport->portnum, > - port->remoteport->portguid, > - port->node->nodedesc, > - port->remoteport->node->nodedesc); > - else > - fprintf(stdout, "%36s'%s'\n", "", > - port->node->nodedesc); > - } > - n++; > - } > + fprintf(stdout, > + "%2s %5d %2d 0x%016" PRIx64 " %s %s", > + ibnd_node_type_str_short(node), > + node->info.type == IBND_SWITCH_NODE ? > + node->smalid : port->info.base_lid, > + port->portnum, > + port->guid, > + ibnd_linkwidth_str(port->info.link_width_active), > + ibnd_linkspeed_str(port->info.link_speed_active, 0)); > + if (port->remoteport) > + fprintf(stdout, > + " - %2s %5d %2d 0x%016" PRIx64 > + " ( '%s' - '%s' )\n", > + ibnd_node_type_str_short(port->remoteport->node), > + port->remoteport->node->info.type == IBND_SWITCH_NODE ? > + port->remoteport->node->smalid : > + port->remoteport->info.base_lid, > + port->remoteport->portnum, > + port->remoteport->guid, > + port->node->nodedesc, > + port->remoteport->node->nodedesc); > + else > + fprintf(stdout, "%36s'%s'\n", "", > + port->node->nodedesc); > + } > } > > void > usage(void) > { > - fprintf(stderr, "Usage: %s [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port " > + fprintf(stderr, "Usage: %s [-d(ebug)] -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port " > "-t(imeout) timeout_ms --node-name-map node-name-map] -p(orts) []\n", > argv0); > fprintf(stderr, " --node-name-map specify a node name map file\n"); > @@ -932,20 +571,18 @@ usage(void) > int > main(int argc, char **argv) > { > - int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; > - ib_portid_t my_portid = {0}; > - int udebug = 0, list = 0; > + int list = 0; > char *ca = 0; > int ca_port = 0; > int group = 0; > int ports_report = 0; > + ibnd_fabric_t *fabric = NULL; > > static char const str_opts[] = "C:P:t:devslgHSRpVhu"; > static const struct option long_opts[] = { > { "C", 1, 0, 'C'}, > { "P", 1, 0, 'P'}, > { "debug", 0, 0, 'd'}, > - { "err_show", 0, 0, 'e'}, > { "verbose", 0, 0, 'v'}, > { "show", 0, 0, 's'}, > { "list", 0, 0, 'l'}, > @@ -981,23 +618,17 @@ main(int argc, char **argv) > ca_port = strtoul(optarg, 0, 0); > break; > case 'd': > - ibdebug++; > - madrpc_show_errors(1); > - umad_debug(udebug); > - udebug++; > + debug = 1; > + ibnd_debug(1); > break; > case 't': > - timeout = strtoul(optarg, 0, 0); > + timeout_ms = strtoul(optarg, 0, 0); > break; > case 'v': > verbose++; > - dumplevel++; > break; > case 's': > - dumplevel = 1; > - break; > - case 'e': > - madrpc_show_errors(1); > + ibnd_show_progress(1); It is not the same madrpc_show_errors() and ibnd_show_progress(), right? > break; > case 'l': > list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; > @@ -1006,13 +637,13 @@ main(int argc, char **argv) > group = 1; > break; > case 'S': > - list = LIST_SWITCH_NODE; > + list |= LIST_SWITCH_NODE; > break; > case 'H': > - list = LIST_CA_NODE; > + list |= LIST_CA_NODE; > break; > case 'R': > - list = LIST_ROUTER_NODE; > + list |= LIST_ROUTER_NODE; > break; > case 'V': > fprintf(stderr, "%s %s\n", argv0, get_build_version() ); > @@ -1029,22 +660,25 @@ main(int argc, char **argv) > argv += optind; > > if (argc && !(f = fopen(argv[0], "w"))) > - IBERROR("can't open file %s for writing", argv[0]); > + fprintf(stderr, "can't open file %s for writing", argv[0]); > > - madrpc_init(ca, ca_port, mgmt_classes, 2); > node_name_map = open_node_name_map(node_name_map_file); > > - if (discover(&my_portid) < 0) > - IBERROR("discover"); > - > - if (group) > - chassis = group_nodes(); > + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { > + fprintf(stderr, "discover failed\n"); > + exit(1); > + } > > if (ports_report) > - dump_ports_report(); > + ibnd_iter_nodes(fabric, > + dump_ports_report, > + NULL); > + else if (list) > + list_nodes(fabric, list); > else > - dump_topology(list, group); > + dump_topology(group, fabric); > > + ibnd_destroy_fabric(fabric); > close_node_name_map(node_name_map); > exit(0); > } > -- > 1.5.4.5 > Sasha From sashak at voltaire.com Wed Jan 21 10:36:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 20:36:42 +0200 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> <20090120190111.GE28955@sashak.voltaire.com> <7989935C4AA44367A115F0776F67CF17@amr.corp.intel.com> <20090120192413.GA2037@sashak.voltaire.com> Message-ID: <20090121183642.GJ3479@sashak.voltaire.com> On 07:51 Wed 21 Jan , Sean Hefty wrote: > > I think all of this is overkill. Is it really that difficult to ensure that new > fields are placed into the array correctly, especially compared to the cost, > say, of ensuring that the field offsets are correct or maintaining an additional > pre-processing script? For maintenance troubles start I will need to ensure that all fields in discussed array are placed in proper order even before the patch - there was never such requirement before and there are more than 200 fields. :( But my concern is even more general - if not using c99 initializations for arrays and structures will become a general requirement I will not happy - what is the point to have a good development tools with all those nice and useful things if we cannot use them? And finally I'm not about exotic extensions, it is standard stuff... BTW, did somebody try to raise this issue in WinOF? What is the reason to be closed with VC? > We're reducing the broader maintenance costs overall by > having a single code base. > > Is this is only open issue integrating the ib-mgmt WinOF support changes? Yes, as far I as can see after brief review. Likely I will apply 1 and 3 later today. Sasha From sashak at voltaire.com Wed Jan 21 10:52:59 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 21 Jan 2009 20:52:59 +0200 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> Message-ID: <20090121185249.GK3479@sashak.voltaire.com> On 11:50 Sat 17 Jan , Arlin Davis wrote: > > [PATCH 1/3] libibmad: add os dependent definitions. > > infiniband/mad_osd.h added to provide support for os specific defintions > for portability. With these changes, WinOF can pull directly from OFED > git tree and share a common code base with minimal changes to mad.h and > source tree. > > mad.h modifications include MAD_EXPORT for export declarations where > appropriate. Datatype llu changed to ULL for 64bit constants. > > makefile.am modified to include new linux version of mad_osd.h > > Signed-off-by: Arlin Davis Applied (patch 1/3) after rebasing. Thanks. Sasha From sumeet.lahorani at oracle.com Wed Jan 21 10:57:05 2009 From: sumeet.lahorani at oracle.com (Sumeet Lahorani) Date: Wed, 21 Jan 2009 10:57:05 -0800 Subject: [ofa-general] Does ib0 always map to port1? Message-ID: <49777001.3050301@oracle.com> Hi, We are using dual ported Mellanox Technologies MT25418 [ConnectX IB DDR] HCAs & OFED 1.3.1. I see that ib0 always maps to port1 and ib1 always maps to port2 on the HCA. I'm trying to find out if this will always be the case and if so which script ensures this mapping? - Sumeet From sean.hefty at intel.com Wed Jan 21 11:02:21 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 21 Jan 2009 11:02:21 -0800 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: <20090121183642.GJ3479@sashak.voltaire.com> References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> <20090120190111.GE28955@sashak.voltaire.com> <7989935C4AA44367A115F0776F67CF17@amr.corp.intel.com> <20090120192413.GA2037@sashak.voltaire.com> <20090121183642.GJ3479@sashak.voltaire.com> Message-ID: >For maintenance troubles start I will need to ensure that all fields in >discussed array are placed in proper order even before the patch - there >was never such requirement before and there are more than 200 fields. :( Arlin did this work and it is included in his patch. >But my concern is even more general - if not using c99 initializations >for arrays and structures will become a general requirement I will not >happy - what is the point to have a good development tools with all those >nice and useful things if we cannot use them? And finally I'm not about >exotic extensions, it is standard stuff... > >BTW, did somebody try to raise this issue in WinOF? What is the reason >to be closed with VC? WinOF was copied on these mailings. So far there has been zero support for changing compilers. (In fact, I've had to push back on branching the code.) Windows drivers must be built using the WDK build environment, so WinOF has adopted that build environment for all of their libraries and applications as well. - Sean From rdreier at cisco.com Wed Jan 21 14:06:14 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 21 Jan 2009 14:06:14 -0800 Subject: [ofa-general] Re: [PATCH] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <20090120180406.GA9991@mtls03> (Eli Cohen's message of "Tue, 20 Jan 2009 20:04:06 +0200") References: <20090120180406.GA9991@mtls03> Message-ID: > Since Linux does not merge adjacent pages into a single scatter entry through > calls to dma_map_sg(), we do this for huge pages which are guaranteed to be > comprised of adjacent natural pages of size PAGE_SIZE. This will result in a > significantly lower number of MTT segments used for registering hugetlb memory > regions. Actually on some platforms, adjacent pages will be put into a single sg entry (eg look at VMERGE under arch/powerpc). So your mlx4_ib_umem_write_huge_mtt() function won't work in such cases. In any case relying on such a fragile implementation detail with no checking is not a good idea for future maintainability. So this needs to be reimplemented to work no matter how the DMA mapping layer presents the hugetlb region (and also fail gracefully if the bus addresses returned from DMA mapping don't work with the huge page size -- since it would be correct for the DMA mapping layer to map every PAGE_SIZE chunk to an arbitrary bus address, eg if the IOMMU address space becomes very fragmented) > @@ -142,15 +176,34 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, > > n = ib_umem_page_count(mr->umem); > shift = ilog2(mr->umem->page_size); > - > - err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, > + if (mr->umem->hugetlb) { > + nhuge = ALIGN(n << shift, HPAGE_SIZE) >> HPAGE_SHIFT; > + shift_huge = HPAGE_SHIFT; > + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, > + convert_access(access_flags), nhuge, shift_huge, &mr->mmr); > + if (err) > + goto err_umem; > + > + err = mlx4_ib_umem_write_huge_mtt(dev, &mr->mmr.mtt, mr->umem, nhuge); > + if (err) { > + if (err != -EAGAIN) > + goto err_mr; > + else { > + mlx4_mr_free(to_mdev(pd->device)->dev, &mr->mmr); > + goto regular_pages; > + } > + } > + } else { > +regular_pages: > + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, > convert_access(access_flags), n, shift, &mr->mmr); > - if (err) > - goto err_umem; > + if (err) > + goto err_umem; > > - err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); > - if (err) > - goto err_mr; > + err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); > + if (err) > + goto err_mr; > + } > > err = mlx4_mr_enable(dev->dev, &mr->mmr); > if (err) Also I think this could be made much clearer by putting the huge page handling code in a separate function like mlx4_ib_reg_hugetlb_user_mr() and then you could just do something like if (!mr->umem->hugetlb || mlx4_ib_reg_hugetlb_user_mr(...)) { // usual code path } rather than having to use a goto into another block. - R. From sashak at voltaire.com Wed Jan 21 16:40:49 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 02:40:49 +0200 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> <0C39159226AA434C87AA8A7C73C7272E@amr.corp.intel.com> <20090120190111.GE28955@sashak.voltaire.com> <7989935C4AA44367A115F0776F67CF17@amr.corp.intel.com> <20090120192413.GA2037@sashak.voltaire.com> <20090121183642.GJ3479@sashak.voltaire.com> Message-ID: <20090122004049.GL3479@sashak.voltaire.com> On 11:02 Wed 21 Jan , Sean Hefty wrote: > >For maintenance troubles start I will need to ensure that all fields in > >discussed array are placed in proper order even before the patch - there > >was never such requirement before and there are more than 200 fields. :( > > Arlin did this work and it is included in his patch. Yes. I see already. > WinOF was copied on these mailings. So far there has been zero support for > changing compilers. (In fact, I've had to push back on branching the code.) > Windows drivers must be built using the WDK build environment, so WinOF has > adopted that build environment for all of their libraries and applications as > well. Drivers can be different story, I'm about user space stuff. I would expect from WinOF guys to be more interested to have portable environment - I'm not doing ports from win :) . Sasha From sashak at voltaire.com Wed Jan 21 16:54:16 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 02:54:16 +0200 Subject: [ofa-general] [PATCH 1/3 v2] libibmad. 2nd revision for WinOF portablity. In-Reply-To: References: <000901c978dc$dfe53ad0$ce97070a@amr.corp.intel.com> <20090120174313.GC28955@sashak.voltaire.com> Message-ID: <20090122005416.GM3479@sashak.voltaire.com> On 11:02 Tue 20 Jan , Davis, Arlin R wrote: > > Like Sean said, until WinOF changes the build enviroment we have > no choice. The majority of the changes went into field.c and given > the structure includes a character field with the mad field name > I would think maintainability is preserved. I added your recent > PortXmitWait and CounterSelect2 changes with no problem. Ok. This is really minor in this specific case (once ib_mad_f array is properly ordered). I'm going to apply three patches. But for future (bigger components - infiniband-diags, OpenSM, etc.), please, please, please, evaluate in WinOF possibility to use portable tools - IMO it should be a reasonable price for portability (needless to say that it save a lot of time and work from WinOF guys at first place). Sasha From sashak at voltaire.com Wed Jan 21 16:54:51 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 02:54:51 +0200 Subject: [ofa-general] [PATCH 2/3 v2] field.c, remove c99 definitions, better portability with WinOF. In-Reply-To: References: Message-ID: <20090122005451.GN3479@sashak.voltaire.com> On 11:51 Sat 17 Jan , Davis, Arlin R wrote: > > - Remove c99 definitions in the ib_mad_f structure. > - Remove unnecessary include file > - _mad_dump: remove c99 structure initialization. > > Signed-off-by: Arlin Davis Applied. Thanks. Sasha From rdreier at cisco.com Wed Jan 21 16:51:33 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 21 Jan 2009 16:51:33 -0800 Subject: [ofa-general] Re: [PATCH] RDMA/nes: Improved use of pbls In-Reply-To: <20090121171108.GA672@ctung-MOBL> (Chien Tung's message of "Wed, 21 Jan 2009 11:11:08 -0600") References: <20090121171108.GA672@ctung-MOBL> Message-ID: > Also, fixed two places where the software pbl counts were changed > before the hardware was updated. This bug allowed another thread > to overallocate the hardware resources. This patch is big enough that I think it needs to be deferred to 2.6.30 given where 2.6.29 is in the release cycle. However if this problem causes problems in practice maybe we should split it out and get it into 2.6.29? - R. From sashak at voltaire.com Wed Jan 21 16:57:28 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 02:57:28 +0200 Subject: [ofa-general] [PATCH 3/3 v2] Minor changes to allow portability to WinOF In-Reply-To: References: Message-ID: <20090122005721.GO3479@sashak.voltaire.com> On 11:51 Sat 17 Jan , Davis, Arlin R wrote: > > - cleanup unnecessary include files > - ULL declaration on 64 bit constants > - cast or change data types to fix build warnings on windows > > Signed-off-by: Arlin Davis Applied (with minor changes rebasing and minor modifications in header files inclusion lists). Thanks. Sasha From sashak at voltaire.com Wed Jan 21 17:24:46 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 03:24:46 +0200 Subject: [ofa-general] [PATCH] libibmad: use mad_set_field64() for mkey encoding In-Reply-To: References: Message-ID: <20090122012446.GP3479@sashak.voltaire.com> Use mad_set_field64() for 64-bit mkey encoding. Signed-off-by: Sasha Khapyorsky --- libibmad/src/mad.c | 4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c index d059a70..3f04da0 100644 --- a/libibmad/src/mad.c +++ b/libibmad/src/mad.c @@ -96,9 +96,7 @@ void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, void *data) mad_set_field(buf, 0, IB_MAD_ATTRMOD_F, rpc->attr.mod); /* words 7,8 */ - mad_set_field(buf, 0, IB_MAD_MKEY_F, (uint32_t) (rpc->mkey >> 32)); - mad_set_field(buf, 4, IB_MAD_MKEY_F, - (uint32_t) (rpc->mkey & 0xffffffff)); + mad_set_field64(buf, 0, IB_MAD_MKEY_F, rpc->mkey); if (rpc->mgtclass == IB_SMI_DIRECT_CLASS) { /* word 9 */ -- 1.6.0.4.766.g6fc4a From ogerlitz at Voltaire.com Wed Jan 21 23:45:13 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Thu, 22 Jan 2009 09:45:13 +0200 Subject: [ofa-general] Does ib0 always map to port1? In-Reply-To: <49777001.3050301@oracle.com> References: <49777001.3050301@oracle.com> Message-ID: <49782409.1080208@Voltaire.com> Sumeet Lahorani wrote: > I see that ib0 always maps to port1 and ib1 always maps to port2 on > the HCA. I'm trying to find out if this will always be the case and if > so which script ensures this mapping? Yes, on a dual ported HCA, ib0 maps to port1 and ib1 to port2, this is a property of the ipoib driver regardless of which HW is used, see drivers/infiniband/ulps/ipoib/ipoib_main :: ipoib_add_one() Or. From dotanba at gmail.com Thu Jan 22 00:18:03 2009 From: dotanba at gmail.com (Dotan Barak) Date: Thu, 22 Jan 2009 10:18:03 +0200 Subject: [ofa-general] Does ib0 always map to port1? In-Reply-To: <49782409.1080208@Voltaire.com> References: <49777001.3050301@oracle.com> <49782409.1080208@Voltaire.com> Message-ID: <2f3bf9a60901220018t91a1615tf73b4efdecb4f3c7@mail.gmail.com> On Thu, Jan 22, 2009 at 9:45 AM, Or Gerlitz wrote: > Sumeet Lahorani wrote: >> >> I see that ib0 always maps to port1 and ib1 always maps to port2 on the >> HCA. I'm trying to find out if this will always be the case and if so which >> script ensures this mapping? > > Yes, on a dual ported HCA, ib0 maps to port1 and ib1 to port2, this is a > property of the ipoib driver regardless of which HW is used, see > drivers/infiniband/ulps/ipoib/ipoib_main :: ipoib_add_one() > Part of the "MAC" address of the I/F is the port GUID, so you can use this value to determine which HCA.port is mapped to that I/Fs... (if there are several HCAs in the same host, this can be very useful ...) Dotan From sashak at voltaire.com Thu Jan 22 00:39:46 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 10:39:46 +0200 Subject: [ofa-general] [PATCH 1/5] opensm/osm_opensm.[ch] make setup and destroy routing engines functions global. In-Reply-To: <4975D964.4020901@gmail.com> References: <4975D824.6020607@gmail.com> <4975D964.4020901@gmail.com> Message-ID: <20090122083930.GQ3479@sashak.voltaire.com> Hi Eli, On 16:02 Tue 20 Jan , Eli Dorfman (Voltaire) wrote: > make setup and destroy routing engines functions global. > change setup_routing_engines() and destroy_routing_engines() > declaration How is it related to configuration update? Is it? I cannot see where it is used, if so why to make it global? > > Signed-off-by: Eli Dorfman > --- > opensm/include/opensm/osm_opensm.h | 53 ++++++++++++++++++++++++++++++++++++ > opensm/opensm/osm_opensm.c | 5 ++- > 2 files changed, 56 insertions(+), 2 deletions(-) > > diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h > index c121be4..5b0a1dd 100644 > --- a/opensm/include/opensm/osm_opensm.h > +++ b/opensm/include/opensm/osm_opensm.h > @@ -458,6 +458,59 @@ osm_opensm_wait_for_subnet_up(IN osm_opensm_t * const p_osm, > * SEE ALSO > *********/ > > +/****f* OpenSM: OpenSM/setup_routing_engines > +* NAME > +* setup_routing_engines > +* > +* DESCRIPTION > +* This function constructs an routing engines. > +* > +* SYNOPSIS > +*/ > +void setup_routing_engines(osm_opensm_t *osm, const char *name); For public function names we are using 'osm_' prefix. > +/* > +* PARAMETERS > +* p_osm > +* [in] Pointer to a OpenSM object to construct. > +* > +* name > +* [in] Routing engine names. > +* > +* RETURN VALUE > +* This function does not return a value. > +* > +* NOTES > +* Setup of routing engines > +* > +* SEE ALSO > +* destroy_routing_engines > +*********/ > + > +/****f* OpenSM: OpenSM/destroy_routing_engines > +* NAME > +* destroy_routing_engines > +* > +* DESCRIPTION > +* This function constructs an routing engines. > +* > +* SYNOPSIS > +*/ > +void destroy_routing_engines(osm_opensm_t *osm); Ditto. Sasha > +/* > +* PARAMETERS > +* p_osm > +* [in] Pointer to a OpenSM object to construct. > +* > +* RETURN VALUE > +* This function does not return a value. > +* > +* NOTES > +* Setup of routing engines > +* > +* SEE ALSO > +* setup_routing_engines > +*********/ > + > /****f* OpenSM: OpenSM/osm_routing_engine_type_str > * NAME > * osm_routing_engine_type_str > diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c > index 7de2e5b..8ecb942 100644 > --- a/opensm/opensm/osm_opensm.c > +++ b/opensm/opensm/osm_opensm.c > @@ -186,7 +186,7 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name) > "cannot find or setup routing engine \'%s\'", name); > } > > -static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) > +void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) > { > char *name, *str, *p; > > @@ -224,7 +224,7 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm) > > /********************************************************************** > **********************************************************************/ > -static void destroy_routing_engines(osm_opensm_t *osm) > +void destroy_routing_engines(osm_opensm_t *osm) > { > struct osm_routing_engine *r, *next; > > @@ -236,6 +236,7 @@ static void destroy_routing_engines(osm_opensm_t *osm) > r->delete(r->context); > free(r); > } > + osm->routing_engine_list = NULL; > } > > /********************************************************************** > -- > 1.5.5 > From sashak at voltaire.com Thu Jan 22 00:44:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 10:44:33 +0200 Subject: [ofa-general] [PATCH 0/5] subnet configuration update In-Reply-To: <4975D824.6020607@gmail.com> References: <4975D824.6020607@gmail.com> Message-ID: <20090122084433.GR3479@sashak.voltaire.com> On 15:56 Tue 20 Jan , Eli Dorfman (Voltaire) wrote: > The following patches are handling subnet configuration update. > Subnet configuration parameters are rescanned every heavy sweep and if possible are updated. Patches 3 and 4 doesn't compile. Patch 5 doesn't apply and after fixing doesn't compile too (don't resend yet, I want to look at this first). Sasha From sashak at voltaire.com Thu Jan 22 01:30:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 11:30:34 +0200 Subject: [ofa-general] [PATCH] libibmad/field.c: fix MAD MKey offset In-Reply-To: <20090122012446.GP3479@sashak.voltaire.com> References: <20090122012446.GP3479@sashak.voltaire.com> Message-ID: <20090122093034.GS3479@sashak.voltaire.com> MAD MKey should be located at offset of 24 bytes which is 192 bits (not 196). Signed-off-by: Sasha Khapyorsky --- libibmad/src/fields.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index 1f2cec5..d5a1eb4 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -87,7 +87,7 @@ static const ib_field_t ib_mad_f[] = { {160, 32, "MadModifier", mad_dump_hex}, /* TODO: add dumper */ /* word 7,8 (24-31 bytes) */ - {196, 64, "MadMkey", mad_dump_hex}, + {192, 64, "MadMkey", mad_dump_hex}, /* word 9 (32-37 bytes) */ {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Thu Jan 22 02:23:43 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 12:23:43 +0200 Subject: [ofa-general] [PATCH 3/5] opensm/osm_subnet.h put qos options flat below subnet opt In-Reply-To: <4975D9E7.4070604@gmail.com> References: <4975D824.6020607@gmail.com> <4975D9E7.4070604@gmail.com> Message-ID: <20090122102343.GT3479@sashak.voltaire.com> Hi Eli, On 16:04 Tue 20 Jan , Eli Dorfman (Voltaire) wrote: > put qos options flat below subnet opt > put all qos option parameters (default, ca, sw, router) flat below subnet opt > > Signed-off-by: Eli Dorfman > --- > opensm/include/opensm/osm_subnet.h | 40 +++++++++++++++++++++++++++--------- > 1 files changed, 30 insertions(+), 10 deletions(-) > > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > index 8863e47..692e449 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -99,11 +99,11 @@ struct osm_qos_policy; > * SYNOPSIS > */ > typedef struct osm_qos_options { > - unsigned max_vls; > - int high_limit; > - char *vlarb_high; > - char *vlarb_low; > - char *sl2vl; > + unsigned qos_max_vls; > + int qos_high_limit; > + char *qos_vlarb_high; > + char *qos_vlarb_low; > + char *qos_sl2vl; > } osm_qos_options_t; > /* > * FIELDS > @@ -199,11 +199,31 @@ typedef struct osm_subn_opt { > boolean_t daemon; > boolean_t sm_inactive; > boolean_t babbling_port_policy; > - osm_qos_options_t qos_options; > - osm_qos_options_t qos_ca_options; > - osm_qos_options_t qos_sw0_options; > - osm_qos_options_t qos_swe_options; > - osm_qos_options_t qos_rtr_options; > + unsigned qos_max_vls; > + int qos_high_limit; > + char *qos_vlarb_high; > + char *qos_vlarb_low; > + char *qos_sl2vl; > + unsigned qos_ca_max_vls; > + int qos_ca_high_limit; > + char *qos_ca_vlarb_high; > + char *qos_ca_vlarb_low; > + char *qos_ca_sl2vl; > + unsigned qos_sw0_max_vls; > + int qos_sw0_high_limit; > + char *qos_sw0_vlarb_high; > + char *qos_sw0_vlarb_low; > + char *qos_sw0_sl2vl; > + unsigned qos_swe_max_vls; > + int qos_swe_high_limit; > + char *qos_swe_vlarb_high; > + char *qos_swe_vlarb_low; > + char *qos_swe_sl2vl; > + unsigned qos_rtr_max_vls; > + int qos_rtr_high_limit; > + char *qos_rtr_vlarb_high; > + char *qos_rtr_vlarb_low; > + char *qos_rtr_sl2vl; Looking on patch 5 I think that I understand your motivation. However I'm not sure that it is a good idea - sooner or later we will need to support QoS port parameters setup configurable per port (and not just per port type as now), so it would be desirable to preserve QoS port parameter processing as whole block in general. Also I think you can use something like: { "qos_ca_max_vls", OPT_OFFSET(qos_ca_options.max_vls), ... }, in your array in patch 5 and preserve QoS configuration unchanged. Sasha > boolean_t enable_quirks; > boolean_t no_clients_rereg; > #ifdef ENABLE_OSM_PERF_MGR > -- > 1.5.5 > From sashak at voltaire.com Thu Jan 22 02:52:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 12:52:31 +0200 Subject: [ofa-general] [PATCH 5/5] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <4975DA59.6070309@gmail.com> References: <4975D824.6020607@gmail.com> <4975DA59.6070309@gmail.com> Message-ID: <20090122105231.GU3479@sashak.voltaire.com> On 16:06 Tue 20 Jan , Eli Dorfman (Voltaire) wrote: > support subnet configuration rescan and update > subnet configuration parameters are rescanned every heavy sweep. > every parameter is defined with an unpack function to parse its value > from opensm configuration file. > some params require special post update operation to apply them. > every parameter has also a flag that specifies whether it can be updated. Basically I like this idea to optimize configuration options processing. However this patch doesn't apply and doesn't compile. > > Signed-off-by: Eli Dorfman > --- > opensm/opensm/osm_subnet.c | 801 +++++++++++++++++++++----------------------- > 1 files changed, 381 insertions(+), 420 deletions(-) > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index a6db304..fe1bbda 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -71,6 +71,183 @@ > > static const char null_str[] = "(null)"; > > +#define OPT_OFFSET(member) offsetof(osm_subn_opt_t, member) > + > +//typedef char *(op_fn_t)(ib_portid_t *dest, char **argv, int argc); Seems like redundant code. > +typedef void (update_fn_t)(osm_subn_t *p_subn, void *p_val); > +typedef void (unpack_fn_t)(osm_subn_t *p_subn, char *p_key, char *p_val_str, void *p_val, update_fn_t *f); > + > +typedef struct opt_rec { > + char *name; const char *? > + int field_offset; unsigned long (for offsets)? > + unpack_fn_t *unpack_fn; > + update_fn_t *update_fn; Probably better names would be 'parse' and 'setup' (I see that it is executed on initial run too) - just nit. > + int can_update; > +} opt_rec_t; > + > +static unpack_fn_t opts_unpack_uint8, opts_unpack_uint16, opts_unpack_net16, opts_unpack_uint32, > + opts_unpack_int32, opts_unpack_net64, opts_unpack_charp, opts_unpack_boolean; > + > +static update_fn_t opts_update_log_max_size, opts_update_sminfo_polling_timeout, > + opts_update_routing_engine, opts_update_sm_priority; > + > +static const opt_rec_t opt_tbl[] = { > + { "guid", OPT_OFFSET(guid), opts_unpack_net64, NULL, 0 }, Trick - you can do something like: #define OPT_DEF(opt) #opt, offsetof(osm_subn_opt_t, opt) , and then: { OPT_DEF(guid), opts_unpack_net64, NULL, 0 }, > + { "m_key", OPT_OFFSET(m_key), opts_unpack_net64, NULL, 1 }, > + { "sm_key", OPT_OFFSET(sm_key), opts_unpack_net64, NULL, 1 }, > + { "sa_key", OPT_OFFSET(sa_key), opts_unpack_net64, NULL, 1 }, > + { "subnet_prefix", OPT_OFFSET(subnet_prefix), opts_unpack_net64, NULL, 1 }, > + { "m_key_lease_period", OPT_OFFSET(m_key_lease_period), opts_unpack_net16, NULL, 1 }, > + { "sweep_interval", OPT_OFFSET(sweep_interval), opts_unpack_uint32, NULL, 1 }, > + { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_unpack_uint32, NULL, 1 }, > + { "console", OPT_OFFSET(console), opts_unpack_charp, NULL, 0 }, > + { "console_port", OPT_OFFSET(console_port), opts_unpack_uint16, NULL, 0 }, > + { "transaction_timeout", OPT_OFFSET(transaction_timeout), opts_unpack_uint32, NULL, 1 }, > + { "max_msg_fifo_timeout", OPT_OFFSET(max_msg_fifo_timeout), opts_unpack_uint32, NULL, 1 }, > + { "sm_priority", OPT_OFFSET(sm_priority), opts_unpack_uint8, opts_update_sm_priority, 1 }, > + { "lmc", OPT_OFFSET(lmc), opts_unpack_uint8, NULL, 1 }, > + { "lmc_esp0", OPT_OFFSET(lmc_esp0), opts_unpack_boolean, NULL, 1 }, > + { "max_op_vls", OPT_OFFSET(max_op_vls), opts_unpack_uint8, NULL, 1 }, > + { "force_link_speed", OPT_OFFSET(force_link_speed), opts_unpack_uint8, NULL, 1 }, > + { "reassign_lids", OPT_OFFSET(reassign_lids), opts_unpack_boolean, NULL, 1 }, > + { "ignore_other_sm", OPT_OFFSET(ignore_other_sm), opts_unpack_boolean, NULL, 1 }, > + { "single_thread", OPT_OFFSET(single_thread), opts_unpack_boolean, NULL, 0 }, > + { "disable_multicast", OPT_OFFSET(disable_multicast), opts_unpack_boolean, NULL, 1 }, > + { "force_log_flush", OPT_OFFSET(force_log_flush), opts_unpack_boolean, NULL, 1 }, > + { "subnet_timeout", OPT_OFFSET(subnet_timeout), opts_unpack_uint8, NULL, 1 }, > + { "packet_life_time", OPT_OFFSET(packet_life_time), opts_unpack_uint8, NULL, 1 }, > + { "vl_stall_count", OPT_OFFSET(vl_stall_count), opts_unpack_uint8, NULL, 1 }, > + { "leaf_vl_stall_count", OPT_OFFSET(leaf_vl_stall_count), opts_unpack_uint8, NULL, 1 }, > + { "head_of_queue_lifetime", OPT_OFFSET(head_of_queue_lifetime), opts_unpack_uint8, NULL, 1 }, > + { "leaf_head_of_queue_lifetime", OPT_OFFSET(leaf_head_of_queue_lifetime), opts_unpack_uint8, NULL, 1 }, > + { "local_phy_errors_threshold", OPT_OFFSET(local_phy_errors_threshold), opts_unpack_uint8, NULL, 1 }, > + { "overrun_errors_threshold", OPT_OFFSET(overrun_errors_threshold), opts_unpack_uint8, NULL, 1 }, > + { "sminfo_polling_timeout", OPT_OFFSET(sminfo_polling_timeout), opts_unpack_uint32, opts_update_sminfo_polling_timeout, 1 }, > + { "polling_retry_number", OPT_OFFSET(polling_retry_number), opts_unpack_uint32, NULL, 1 }, > + { "force_heavy_sweep", OPT_OFFSET(force_heavy_sweep), opts_unpack_boolean, NULL, 1 }, > + { "log_flags", OPT_OFFSET(log_flags), opts_unpack_uint8, NULL, 1 }, > + { "port_prof_ignore_file", OPT_OFFSET(port_prof_ignore_file), opts_unpack_charp, NULL, 1 }, > + { "port_profile_switch_nodes", OPT_OFFSET(port_profile_switch_nodes), opts_unpack_boolean, NULL, 1 }, > + { "sweep_on_trap", OPT_OFFSET(sweep_on_trap), opts_unpack_boolean, NULL, 1 }, > + { "routing_engine", OPT_OFFSET(routing_engine_names), opts_unpack_charp, opts_update_routing_engine, 1 }, > + { "connect_roots", OPT_OFFSET(connect_roots), opts_unpack_boolean, NULL, 1 }, > + { "use_ucast_cache", OPT_OFFSET(use_ucast_cache), opts_unpack_boolean, NULL, 1 }, > + { "log_file", OPT_OFFSET(log_file), opts_unpack_charp, NULL, 0 }, > + { "log_max_size", OPT_OFFSET(log_max_size), opts_unpack_uint32, opts_update_log_max_size }, > + { "partition_config_file", OPT_OFFSET(partition_config_file), opts_unpack_charp, NULL, 1 }, > + { "no_partition_enforcement", OPT_OFFSET(no_partition_enforcement), opts_unpack_boolean, NULL, 1 }, > + { "qos", OPT_OFFSET(qos), opts_unpack_boolean, NULL, 1 }, > + { "qos_policy_file", OPT_OFFSET(qos_policy_file), opts_unpack_charp, NULL, 1 }, > + { "accum_log_file", OPT_OFFSET(accum_log_file), opts_unpack_boolean, NULL, 1 }, > + { "dump_files_dir", OPT_OFFSET(dump_files_dir), opts_unpack_charp, NULL, 1 }, > + { "lid_matrix_dump_file", OPT_OFFSET(lid_matrix_dump_file), opts_unpack_charp, NULL, 1 }, > + { "lfts_file", OPT_OFFSET(lfts_file), opts_unpack_charp, NULL, 1 }, > + { "root_guid_file", OPT_OFFSET(root_guid_file), opts_unpack_charp, NULL, 1 }, > + { "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_unpack_charp, NULL, 1 }, > + { "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_unpack_charp, NULL, 1 }, > + { "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file), opts_unpack_charp, NULL, 1 }, > + { "sa_db_file", OPT_OFFSET(sa_db_file), opts_unpack_charp, NULL, 1 }, > + { "do_mesh_analysis", OPT_OFFSET(do_mesh_analysis), opts_unpack_boolean, NULL, 1 }, > + { "exit_on_fatal", OPT_OFFSET(exit_on_fatal), opts_unpack_boolean, NULL, 1 }, > + { "honor_guid2lid_file", OPT_OFFSET(honor_guid2lid_file), opts_unpack_boolean, NULL, 1 }, > + { "daemon", OPT_OFFSET(daemon), opts_unpack_boolean, NULL, 0 }, > + { "sm_inactive", OPT_OFFSET(sm_inactive), opts_unpack_boolean, NULL, 1 }, > + { "babbling_port_policy", OPT_OFFSET(babbling_port_policy), opts_unpack_boolean, NULL, 1 }, > + > +#ifdef ENABLE_OSM_PERF_MGR > + { "perfmgr", OPT_OFFSET(perfmgr), opts_unpack_boolean, NULL, 0 }, > + { "perfmgr_redir", OPT_OFFSET(perfmgr_redir), opts_unpack_boolean NULL, 0 }, > + { "perfmgr_sweep_time_s", OPT_OFFSET(perfmgr_sweep_time_s), opts_unpack_uint16, NULL, 0 }, > + { "perfmgr_max_outstanding_queries", OPT_OFFSET(perfmgr_max_outstanding_queries), opts_unpack_uint32, NULL, 0 }, > + { "event_db_dump_file", OPT_OFFSET(event_db_dump_file), opts_unpack_charp, NULL, 0 }, > +#endif /* ENABLE_OSM_PERF_MGR */ > + > + { "event_plugin_name", OPT_OFFSET(event_plugin_name), opts_unpack_charp, NULL, 0 }, > + { "node_name_map_name", OPT_OFFSET(node_name_map_name), opts_unpack_charp, NULL, 0 }, > + > + { "qos_max_vls", OPT_OFFSET(qos_max_vls), opts_unpack_uint32, NULL, 1 }, > + { "qos_high_limit", OPT_OFFSET(qos_high_limit), opts_unpack_int32, NULL, 1 }, > + { "qos_vlarb_high", OPT_OFFSET(qos_vlarb_high), opts_unpack_charp, NULL, 1 }, > + { "qos_vlarb_low", OPT_OFFSET(qos_vlarb_low), opts_unpack_charp, NULL, 1 }, > + { "qos_sl2vl", OPT_OFFSET(qos_sl2vl), opts_unpack_charp, NULL, 1 }, > + > + { "qos_ca_max_vls", OPT_OFFSET(qos_ca_max_vls), opts_unpack_uint32, NULL, 1 }, > + { "qos_ca_high_limit", OPT_OFFSET(qos_ca_high_limit), opts_unpack_int32, NULL, 1 }, > + { "qos_ca_vlarb_high", OPT_OFFSET(qos_ca_vlarb_high), opts_unpack_charp, NULL, 1 }, > + { "qos_ca_vlarb_low", OPT_OFFSET(qos_ca_vlarb_low), opts_unpack_charp, NULL, 1 }, > + { "qos_ca_sl2vl", OPT_OFFSET(qos_ca_sl2vl), opts_unpack_charp, NULL, 1 }, > + > + { "qos_sw0_max_vls", OPT_OFFSET(qos_sw0_max_vls), opts_unpack_uint32, NULL, 1 }, > + { "qos_sw0_high_limit", OPT_OFFSET(qos_sw0_high_limit), opts_unpack_int32, NULL, 1 }, > + { "qos_sw0_vlarb_high", OPT_OFFSET(qos_sw0_vlarb_high), opts_unpack_charp, NULL, 1 }, > + { "qos_sw0_vlarb_low", OPT_OFFSET(qos_sw0_vlarb_low), opts_unpack_charp, NULL, 1 }, > + { "qos_sw0_sl2vl", OPT_OFFSET(qos_sw0_sl2vl), opts_unpack_charp, NULL, 1 }, > + > + { "qos_swe_max_vls", OPT_OFFSET(qos_swe_max_vls), opts_unpack_uint32, NULL, 1 }, > + { "qos_swe_high_limit", OPT_OFFSET(qos_swe_high_limit), opts_unpack_int32, NULL, 1 }, > + { "qos_swe_vlarb_high", OPT_OFFSET(qos_swe_vlarb_high), opts_unpack_charp, NULL, 1 }, > + { "qos_swe_vlarb_low", OPT_OFFSET(qos_swe_vlarb_low), opts_unpack_charp, NULL, 1 }, > + { "qos_swe_sl2vl", OPT_OFFSET(qos_swe_sl2vl), opts_unpack_charp, NULL, 1 }, > + > + { "qos_rtr_max_vls", OPT_OFFSET(qos_rtr_max_vls), opts_unpack_uint32, NULL, 1 }, > + { "qos_rtr_high_limit", OPT_OFFSET(qos_rtr_high_limit), opts_unpack_int32, NULL, 1 }, > + { "qos_rtr_vlarb_high", OPT_OFFSET(qos_rtr_vlarb_high), opts_unpack_charp, NULL, 1 }, > + { "qos_rtr_vlarb_low", OPT_OFFSET(qos_rtr_vlarb_low), opts_unpack_charp, NULL, 1 }, > + { "qos_rtr_sl2vl", OPT_OFFSET(qos_rtr_sl2vl), opts_unpack_charp, NULL, 1 }, > + > + { "enable_quirks", OPT_OFFSET(enable_quirks), opts_unpack_boolean, NULL, 1 }, > + { "no_clients_rereg", OPT_OFFSET(no_clients_rereg), opts_unpack_boolean, NULL, 1 }, > + { "prefix_routes_file", OPT_OFFSET(prefix_routes_file), opts_unpack_charp, NULL, 1 }, > + { "consolidate_ipv6_snm_req", OPT_OFFSET(consolidate_ipv6_snm_req), opts_unpack_boolean, NULL, 1 }, > + {0} > +}; > + > +static void opts_update_log_max_size(osm_subn_t *p_subn, void *p_val) > +{ > + uint32_t log_max_size = *((uint32_t *) p_val); > + > + if (!p_subn) > + return; > + > + p_subn->opt.log_max_size = log_max_size << 20; /* convert from MB to Bytes */ > +} > + > +static void opts_update_sminfo_polling_timeout(osm_subn_t *p_subn, void *p_val) > +{ > + osm_sm_t *p_sm; > + uint32_t sminfo_polling_timeout = *((uint32_t *) p_val); > + > + if (!p_subn) > + return; > + > + p_sm = &p_subn->p_osm->sm; > + cl_timer_stop(&p_sm->polling_timer); > + cl_timer_start(&p_sm->polling_timer, sminfo_polling_timeout); > +} > + > +static void opts_update_routing_engine(osm_subn_t *p_subn, void *p_val) > +{ > + char *routing_engine_names = (char *) p_val; > + > + if (!p_subn) > + return; > + > + destroy_routing_engines(p_subn->p_osm); > + setup_routing_engines(p_subn->p_osm, routing_engine_names); > +} > + > +static void opts_update_sm_priority(osm_subn_t *p_subn, void *p_val) > +{ > + osm_sm_t *p_sm; > + uint8_t sm_priority = *((uint8_t *) p_val); > + > + if (!p_subn) > + return; > + > + p_sm = &p_subn->p_osm->sm; > + osm_set_sm_priority(p_sm, sm_priority); > +} > + > /********************************************************************** > **********************************************************************/ > void osm_subn_construct(IN osm_subn_t * const p_subn) > @@ -315,32 +492,30 @@ osm_port_t *osm_get_port_by_guid(IN osm_subn_t const *p_subn, IN ib_net64_t guid > **********************************************************************/ > static void subn_set_default_qos_options(IN osm_qos_options_t * opt) > { > - opt->max_vls = OSM_DEFAULT_QOS_MAX_VLS; > - opt->high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT; > - opt->vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH; > - opt->vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW; > - opt->sl2vl = OSM_DEFAULT_QOS_SL2VL; > + opt->qos_max_vls = OSM_DEFAULT_QOS_MAX_VLS; > + opt->qos_high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT; > + opt->qos_vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH; > + opt->qos_vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW; > + opt->qos_sl2vl = OSM_DEFAULT_QOS_SL2VL; > } > > -static void subn_init_qos_options(IN osm_qos_options_t * opt) > -{ > - opt->max_vls = 0; > - opt->high_limit = -1; > - opt->vlarb_high = NULL; > - opt->vlarb_low = NULL; > - opt->sl2vl = NULL; > +#define subn_init_qos_options(opt) \ > +{ \ > + opt ## _max_vls = 0; \ > + opt ## _high_limit = -1; \ > + opt ## _vlarb_high = NULL; \ > + opt ## _vlarb_low = NULL; \ > + opt ## _sl2vl = NULL; \ > } > > -static void subn_free_qos_options(IN osm_qos_options_t * opt) > -{ > - if (opt->vlarb_high && opt->vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH) > - free(opt->vlarb_high); > - > - if (opt->vlarb_low && opt->vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW) > - free(opt->vlarb_low); > - > - if (opt->sl2vl && opt->sl2vl != OSM_DEFAULT_QOS_SL2VL) > - free(opt->sl2vl); > +#define subn_free_qos_options(opt) \ > +{ \ > + if (opt ## _vlarb_high && opt ## _vlarb_high != OSM_DEFAULT_QOS_VLARB_HIGH) \ > + free(opt ## _vlarb_high); \ > + if (opt ## _vlarb_low && opt ## _vlarb_low != OSM_DEFAULT_QOS_VLARB_LOW) \ > + free(opt ## _vlarb_low); \ > + if (opt ## _sl2vl && opt ## _sl2vl != OSM_DEFAULT_QOS_SL2VL) \ > + free(opt ## _sl2vl); \ > } > > /********************************************************************** > @@ -431,11 +606,11 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) > p_opt->no_clients_rereg = FALSE; > p_opt->prefix_routes_file = OSM_DEFAULT_PREFIX_ROUTES_FILE; > p_opt->consolidate_ipv6_snm_req = FALSE; > - subn_init_qos_options(&p_opt->qos_options); > - subn_init_qos_options(&p_opt->qos_ca_options); > - subn_init_qos_options(&p_opt->qos_sw0_options); > - subn_init_qos_options(&p_opt->qos_swe_options); > - subn_init_qos_options(&p_opt->qos_rtr_options); > + subn_init_qos_options(p_opt->qos); > + subn_init_qos_options(p_opt->qos_ca); > + subn_init_qos_options(p_opt->qos_sw0); > + subn_init_qos_options(p_opt->qos_swe); > + subn_init_qos_options(p_opt->qos_rtr); > } > > /********************************************************************** > @@ -470,137 +645,167 @@ static void log_config_value(char *name, const char *fmt, ...) > } > > static void > -opts_unpack_net64(IN char *p_req_key, > - IN char *p_key, IN char *p_val_str, IN uint64_t * p_val) > +opts_unpack_net64(IN osm_subn_t *p_subn, > + IN char *p_key, IN char *p_val_str, > + IN void *p_v, IN update_fn_t pfn) > { > - if (!strcmp(p_req_key, p_key)) { > - uint64_t val = strtoull(p_val_str, NULL, 0); > - if (cl_hton64(val) != *p_val) { > - log_config_value(p_key, "0x%016" PRIx64, val); > - *p_val = cl_ntoh64(val); > - } > + uint64_t *p_val = (uint64_t *) p_v; > + uint64_t val = strtoull(p_val_str, NULL, 0); > + > + if (cl_hton64(val) != *p_val) { > + log_config_value(p_key, "0x%016" PRIx64, val); > + if (pfn) > + pfn(p_subn, &val); > + *p_val = cl_ntoh64(val); > } > } > > /********************************************************************** > **********************************************************************/ > static void > -opts_unpack_uint32(IN char *p_req_key, > - IN char *p_key, IN char *p_val_str, IN uint32_t * p_val) > +opts_unpack_uint32(IN osm_subn_t *p_subn, > + IN char *p_key, IN char *p_val_str, > + IN void *p_v, IN update_fn_t pfn) > { > - if (!strcmp(p_req_key, p_key)) { > - uint32_t val = strtoul(p_val_str, NULL, 0); > - if (val != *p_val) { > - log_config_value(p_key, "%u", val); > - *p_val = val; > - } > + uint32_t *p_val = (uint32_t *) p_v; > + uint32_t val = strtoul(p_val_str, NULL, 0); > + > + if (val != *p_val) { > + log_config_value(p_key, "%u", val); > + if (pfn) > + pfn(p_subn, &val); > + *p_val = val; > } > } > > /********************************************************************** > **********************************************************************/ > static void > -opts_unpack_int32(IN char *p_req_key, > - IN char *p_key, IN char *p_val_str, IN int32_t * p_val) > +opts_unpack_int32(IN osm_subn_t *p_subn, > + IN char *p_key, IN char *p_val_str, > + IN void *p_v, IN update_fn_t pfn) > { > - if (!strcmp(p_req_key, p_key)) { > - int32_t val = strtol(p_val_str, NULL, 0); > - if (val != *p_val) { > - log_config_value(p_key, "%d", val); > - *p_val = val; > - } > + int32_t *p_val = (int32_t *) p_v; > + int32_t val = strtol(p_val_str, NULL, 0); > + > + if (val != *p_val) { > + log_config_value(p_key, "%d", val); > + if (pfn) > + pfn(p_subn, &val); > + *p_val = val; > } > } > > /********************************************************************** > **********************************************************************/ > static void > -opts_unpack_uint16(IN char *p_req_key, > - IN char *p_key, IN char *p_val_str, IN uint16_t * p_val) > +opts_unpack_uint16(IN osm_subn_t *p_subn, > + IN char *p_key, IN char *p_val_str, > + IN void *p_v, IN update_fn_t pfn) > { > - if (!strcmp(p_req_key, p_key)) { > - uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); > - if (val != *p_val) { > - log_config_value(p_key, "%u", val); > - *p_val = val; > - } > + uint16_t *p_val = (uint16_t *) p_v; > + uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); > + > + if (val != *p_val) { > + log_config_value(p_key, "%u", val); > + if (pfn) > + pfn(p_subn, &val); > + *p_val = val; > } > } > > /********************************************************************** > **********************************************************************/ > static void > -opts_unpack_net16(IN char *p_req_key, > - IN char *p_key, IN char *p_val_str, IN uint16_t * p_val) > +opts_unpack_net16(IN osm_subn_t *p_subn, > + IN char *p_key, IN char *p_val_str, > + IN void *p_v, IN update_fn_t pfn) > { > - if (!strcmp(p_req_key, p_key)) { > - uint32_t val; > - val = strtoul(p_val_str, NULL, 0); > - CL_ASSERT(val < 0x10000); > - if (cl_hton32(val) != *p_val) { > - log_config_value(p_key, "0x%04x", val); > - *p_val = cl_hton16((uint16_t) val); > - } > + uint16_t *p_val = (uint16_t *) p_v; > + uint32_t val = strtoul(p_val_str, NULL, 0); > + > + CL_ASSERT(val < 0x10000); > + if (cl_hton32(val) != *p_val) { > + log_config_value(p_key, "0x%04x", val); > + if (pfn) > + pfn(p_subn, &val); > + *p_val = cl_hton16((uint16_t) val); > } > } > > /********************************************************************** > **********************************************************************/ > static void > -opts_unpack_uint8(IN char *p_req_key, > - IN char *p_key, IN char *p_val_str, IN uint8_t * p_val) > +opts_unpack_uint8(IN osm_subn_t *p_subn, > + IN char *p_key, IN char *p_val_str, > + IN void *p_v, IN update_fn_t pfn) > { > - if (!strcmp(p_req_key, p_key)) { > - uint32_t val; > - val = strtoul(p_val_str, NULL, 0); > - CL_ASSERT(val < 0x100); > - if (val != *p_val) { > - log_config_value(p_key, "%u", val); > - *p_val = (uint8_t) val; > - } > + uint8_t *p_val = (uint8_t *) p_v; > + uint32_t val = strtoul(p_val_str, NULL, 0); > + > + CL_ASSERT(val < 0x100); > + if (val != *p_val) { > + log_config_value(p_key, "%u", val); > + if (pfn) > + pfn(p_subn, &val); > + *p_val = (uint8_t) val; > } > } > > /********************************************************************** > **********************************************************************/ > static void > -opts_unpack_boolean(IN char *p_req_key, > - IN char *p_key, IN char *p_val_str, IN boolean_t * p_val) > +opts_unpack_boolean(IN osm_subn_t *p_subn, > + IN char *p_key, IN char *p_val_str, > + IN void *p_v, IN update_fn_t pfn) > { > - if (!strcmp(p_req_key, p_key) && p_val_str) { > - boolean_t val; > - if (strcmp("TRUE", p_val_str)) > - val = FALSE; > - else > - val = TRUE; > - > - if (val != *p_val) { > - log_config_value(p_key, "%s", p_val_str); > - *p_val = val; > - } > + boolean_t *p_val = (boolean_t *) p_v; > + boolean_t val; > + > + if (!p_val_str) > + return; > + > + if (strcmp("TRUE", p_val_str)) > + val = FALSE; > + else > + val = TRUE; > + > + if (val != *p_val) { > + log_config_value(p_key, "%s", p_val_str); > + if (pfn) > + pfn(p_subn, &val); > + *p_val = val; > } > } > > /********************************************************************** > **********************************************************************/ > static void > -opts_unpack_charp(IN char *p_req_key, > - IN char *p_key, IN char *p_val_str, IN char **p_val) > +opts_unpack_charp(IN osm_subn_t *p_subn, > + IN char *p_key, IN char *p_val_str, > + IN void *p_v, IN update_fn_t pfn) > { > - if (!strcmp(p_req_key, p_key) && p_val_str) { > - const char *current_str = *p_val ? *p_val : null_str ; > - if (strcmp(p_val_str, current_str)) { > - log_config_value(p_key, "%s", p_val_str); > - /* special case the "(null)" string */ > - if (strcmp(null_str, p_val_str) == 0) { > - *p_val = NULL; > - } else { > - /* > - Ignore the possible memory leak here; > - the pointer may be to a static default. > - */ > - *p_val = strdup(p_val_str); > - } > + char **p_val = (char **) p_v; > + const char *current_str = *p_val ? *p_val : null_str ; > + > + if (!p_val_str) > + return; > + > + if (strcmp(p_val_str, current_str)) { > + log_config_value(p_key, "%s", p_val_str); > + /* special case the "(null)" string */ > + if (strcmp(null_str, p_val_str) == 0) { > + if (pfn) > + pfn(p_subn, NULL); > + *p_val = NULL; > + } else { > + if (pfn) > + pfn(p_subn, p_val_str); > + /* > + Ignore the possible memory leak here; > + the pointer may be to a static default. > + */ > + *p_val = strdup(p_val_str); There will be memory leak if you are going to rescan options - likely you need to free a value first if it is not NULL (and allocate on initial default setup). In order to not complicate things a lot it is probably a good idea to split this patch into two: 1) rework config option processing 2) rescan functionality Hmm, BTW what about adding option default values to opt_rec structure? > } > } > } > @@ -631,41 +836,20 @@ static char *clean_val(char *val) > > /********************************************************************** > **********************************************************************/ > -static void > -subn_parse_qos_options(IN const char *prefix, > - IN char *p_key, > - IN char *p_val_str, IN osm_qos_options_t * opt) > -{ > - char name[256]; > - > - snprintf(name, sizeof(name), "%s_max_vls", prefix); > - opts_unpack_uint32(name, p_key, p_val_str, &opt->max_vls); > - snprintf(name, sizeof(name), "%s_high_limit", prefix); > - opts_unpack_int32(name, p_key, p_val_str, &opt->high_limit); > - snprintf(name, sizeof(name), "%s_vlarb_high", prefix); > - opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_high); > - snprintf(name, sizeof(name), "%s_vlarb_low", prefix); > - opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_low); > - snprintf(name, sizeof(name), "%s_sl2vl", prefix); > - opts_unpack_charp(name, p_key, p_val_str, &opt->sl2vl); > -} > - > -static int > -subn_dump_qos_options(FILE * file, > - const char *set_name, > - const char *prefix, osm_qos_options_t * opt) > -{ > - return fprintf(file, "# %s\n" > - "%s_max_vls %u\n" > - "%s_high_limit %d\n" > - "%s_vlarb_high %s\n" > - "%s_vlarb_low %s\n" > - "%s_sl2vl %s\n", > - set_name, > - prefix, opt->max_vls, > - prefix, opt->high_limit, > - prefix, opt->vlarb_high, > - prefix, opt->vlarb_low, prefix, opt->sl2vl); > +#define subn_dump_qos_options(file, name, prefix, opt) \ > +{ \ > + fprintf(file, "# %s\n" \ > + "%s_max_vls %u\n" \ > + "%s_high_limit %d\n" \ > + "%s_vlarb_high %s\n" \ > + "%s_vlarb_low %s\n" \ > + "%s_sl2vl %s\n", \ > + name, \ > + prefix, opt ## _max_vls, \ > + prefix, opt ## _high_limit, \ > + prefix, opt ## _vlarb_high, \ > + prefix, opt ## _vlarb_low, \ > + prefix, opt ## _sl2vl); \ > } > > /********************************************************************** > @@ -904,14 +1088,13 @@ static void subn_verify_sl2vl(char **sl2vl, const char *prefix, char *dflt) > free(str); > } > > -static void subn_verify_qos_set(osm_qos_options_t *set, const char *prefix, > - osm_qos_options_t *dflt) > -{ > - subn_verify_max_vls(&set->max_vls, prefix, dflt->max_vls); > - subn_verify_high_limit(&set->high_limit, prefix, dflt->high_limit); > - subn_verify_vlarb(&set->vlarb_low, prefix, "low", dflt->vlarb_low); > - subn_verify_vlarb(&set->vlarb_high, prefix, "high", dflt->vlarb_high); > - subn_verify_sl2vl(&set->sl2vl, prefix, dflt->sl2vl); > +#define subn_verify_qos_set(set, prefix, dflt) \ > +{ \ > + subn_verify_max_vls(&set ## _max_vls, prefix, dflt ## _max_vls); \ > + subn_verify_high_limit(&set ## _high_limit, prefix, dflt ## _high_limit); \ > + subn_verify_vlarb(&set ## _vlarb_low, prefix, "low", dflt ## _vlarb_low); \ > + subn_verify_vlarb(&set ## _vlarb_high, prefix, "high", dflt ## _vlarb_high); \ > + subn_verify_sl2vl(&set ## _sl2vl, prefix, dflt ## _sl2vl); \ > } > > int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) > @@ -960,15 +1143,11 @@ int osm_subn_verify_config(IN osm_subn_opt_t * const p_opts) > > subn_set_default_qos_options(&dflt); > > - subn_verify_qos_set(&p_opts->qos_options, "qos", &dflt); > - subn_verify_qos_set(&p_opts->qos_ca_options, "qos_ca", > - &p_opts->qos_options); > - subn_verify_qos_set(&p_opts->qos_sw0_options, "qos_sw0", > - &p_opts->qos_options); > - subn_verify_qos_set(&p_opts->qos_swe_options, "qos_swe", > - &p_opts->qos_options); > - subn_verify_qos_set(&p_opts->qos_rtr_options, "qos_rtr", > - &p_opts->qos_options); > + subn_verify_qos_set(p_opts->qos, "qos", dflt.qos); > + subn_verify_qos_set(p_opts->qos_ca, "qos_ca", p_opts->qos); > + subn_verify_qos_set(p_opts->qos_sw0, "qos_sw0", p_opts->qos); > + subn_verify_qos_set(p_opts->qos_swe, "qos_swe", p_opts->qos); > + subn_verify_qos_set(p_opts->qos_rtr, "qos_rtr", p_opts->qos); > } > > #ifdef ENABLE_OSM_PERF_MGR > @@ -1000,6 +1179,8 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) > char line[1024]; > FILE *opts_file; > char *p_key, *p_val; > + const opt_rec_t *r; > + char *p_field; void * (generic pointer)? > > opts_file = fopen(file_name, "r"); > if (!opts_file) { > @@ -1023,231 +1204,14 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) > > p_val = clean_val(p_val); > > - opts_unpack_net64("guid", p_key, p_val, &p_opts->guid); > - > - opts_unpack_net64("m_key", p_key, p_val, &p_opts->m_key); > - > - opts_unpack_net64("sm_key", p_key, p_val, &p_opts->sm_key); > - > - opts_unpack_net64("sa_key", p_key, p_val, &p_opts->sa_key); > - > - opts_unpack_net64("subnet_prefix", > - p_key, p_val, &p_opts->subnet_prefix); > - > - opts_unpack_net16("m_key_lease_period", > - p_key, p_val, &p_opts->m_key_lease_period); > - > - opts_unpack_uint32("sweep_interval", > - p_key, p_val, &p_opts->sweep_interval); > - > - opts_unpack_uint32("max_wire_smps", > - p_key, p_val, &p_opts->max_wire_smps); > - > - opts_unpack_charp("console", p_key, p_val, &p_opts->console); > - > - opts_unpack_uint16("console_port", > - p_key, p_val, &p_opts->console_port); > - > - opts_unpack_uint32("transaction_timeout", > - p_key, p_val, &p_opts->transaction_timeout); > - > - opts_unpack_uint32("max_msg_fifo_timeout", > - p_key, p_val, &p_opts->max_msg_fifo_timeout); > - > - opts_unpack_uint8("sm_priority", > - p_key, p_val, &p_opts->sm_priority); > - > - opts_unpack_uint8("lmc", p_key, p_val, &p_opts->lmc); > - > - opts_unpack_boolean("lmc_esp0", > - p_key, p_val, &p_opts->lmc_esp0); > - > - opts_unpack_uint8("max_op_vls", > - p_key, p_val, &p_opts->max_op_vls); > - > - opts_unpack_uint8("force_link_speed", > - p_key, p_val, &p_opts->force_link_speed); > - > - opts_unpack_boolean("reassign_lids", > - p_key, p_val, &p_opts->reassign_lids); > - > - opts_unpack_boolean("ignore_other_sm", > - p_key, p_val, &p_opts->ignore_other_sm); > - > - opts_unpack_boolean("single_thread", > - p_key, p_val, &p_opts->single_thread); > - > - opts_unpack_boolean("disable_multicast", > - p_key, p_val, &p_opts->disable_multicast); > - > - opts_unpack_boolean("force_log_flush", > - p_key, p_val, &p_opts->force_log_flush); > - > - opts_unpack_uint8("subnet_timeout", > - p_key, p_val, &p_opts->subnet_timeout); > - > - opts_unpack_uint8("packet_life_time", > - p_key, p_val, &p_opts->packet_life_time); > - > - opts_unpack_uint8("vl_stall_count", > - p_key, p_val, &p_opts->vl_stall_count); > - > - opts_unpack_uint8("leaf_vl_stall_count", > - p_key, p_val, &p_opts->leaf_vl_stall_count); > - > - opts_unpack_uint8("head_of_queue_lifetime", p_key, p_val, > - &p_opts->head_of_queue_lifetime); > - > - opts_unpack_uint8("leaf_head_of_queue_lifetime", p_key, p_val, > - &p_opts->leaf_head_of_queue_lifetime); > - > - opts_unpack_uint8("local_phy_errors_threshold", p_key, p_val, > - &p_opts->local_phy_errors_threshold); > - > - opts_unpack_uint8("overrun_errors_threshold", p_key, p_val, > - &p_opts->overrun_errors_threshold); > - > - opts_unpack_uint32("sminfo_polling_timeout", p_key, p_val, > - &p_opts->sminfo_polling_timeout); > - > - opts_unpack_uint32("polling_retry_number", > - p_key, p_val, &p_opts->polling_retry_number); > - > - opts_unpack_boolean("force_heavy_sweep", > - p_key, p_val, &p_opts->force_heavy_sweep); > - > - opts_unpack_uint8("log_flags", > - p_key, p_val, &p_opts->log_flags); > - > - opts_unpack_charp("port_prof_ignore_file", p_key, p_val, > - &p_opts->port_prof_ignore_file); > - > - opts_unpack_boolean("port_profile_switch_nodes", p_key, p_val, > - &p_opts->port_profile_switch_nodes); > - > - opts_unpack_boolean("sweep_on_trap", > - p_key, p_val, &p_opts->sweep_on_trap); > - > - opts_unpack_charp("routing_engine", > - p_key, p_val, &p_opts->routing_engine_names); > - > - opts_unpack_boolean("connect_roots", > - p_key, p_val, &p_opts->connect_roots); > - > - opts_unpack_boolean("use_ucast_cache", > - p_key, p_val, &p_opts->use_ucast_cache); > - > - opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file); > - > - opts_unpack_uint32("log_max_size", p_key, p_val, > - (void *) & p_opts->log_max_size); > - p_opts->log_max_size *= 1024 * 1024; /* convert to MB */ > - > - opts_unpack_charp("partition_config_file", > - p_key, p_val, &p_opts->partition_config_file); > - > - opts_unpack_boolean("no_partition_enforcement", p_key, p_val, > - &p_opts->no_partition_enforcement); > - > - opts_unpack_boolean("qos", p_key, p_val, &p_opts->qos); > - > - opts_unpack_charp("qos_policy_file", > - p_key, p_val, &p_opts->qos_policy_file); > - > - opts_unpack_boolean("accum_log_file", > - p_key, p_val, &p_opts->accum_log_file); > - > - opts_unpack_charp("dump_files_dir", > - p_key, p_val, &p_opts->dump_files_dir); > - > - opts_unpack_charp("lid_matrix_dump_file", > - p_key, p_val, &p_opts->lid_matrix_dump_file); > - > - opts_unpack_charp("lfts_file", > - p_key, p_val, &p_opts->lfts_file); > - > - opts_unpack_charp("root_guid_file", > - p_key, p_val, &p_opts->root_guid_file); > - > - opts_unpack_charp("cn_guid_file", > - p_key, p_val, &p_opts->cn_guid_file); > - > - opts_unpack_charp("ids_guid_file", > - p_key, p_val, &p_opts->ids_guid_file); > - > - opts_unpack_charp("guid_routing_order_file", p_key, p_val, > - &p_opts->guid_routing_order_file); > - > - opts_unpack_charp("sa_db_file", > - p_key, p_val, &p_opts->sa_db_file); > - > - opts_unpack_boolean("do_mesh_analysis", > - p_key, p_val, &p_opts->do_mesh_analysis); > - > - opts_unpack_boolean("exit_on_fatal", > - p_key, p_val, &p_opts->exit_on_fatal); > - > - opts_unpack_boolean("honor_guid2lid_file", > - p_key, p_val, &p_opts->honor_guid2lid_file); > - > - opts_unpack_boolean("daemon", p_key, p_val, &p_opts->daemon); > - > - opts_unpack_boolean("sm_inactive", > - p_key, p_val, &p_opts->sm_inactive); > - > - opts_unpack_boolean("babbling_port_policy", > - p_key, p_val, > - &p_opts->babbling_port_policy); > - > -#ifdef ENABLE_OSM_PERF_MGR > - opts_unpack_boolean("perfmgr", p_key, p_val, &p_opts->perfmgr); > - > - opts_unpack_boolean("perfmgr_redir", > - p_key, p_val, &p_opts->perfmgr_redir); > - > - opts_unpack_uint16("perfmgr_sweep_time_s", > - p_key, p_val, &p_opts->perfmgr_sweep_time_s); > - > - opts_unpack_uint32("perfmgr_max_outstanding_queries", > - p_key, p_val, > - &p_opts->perfmgr_max_outstanding_queries); > - > - opts_unpack_charp("event_db_dump_file", > - p_key, p_val, &p_opts->event_db_dump_file); > -#endif /* ENABLE_OSM_PERF_MGR */ > - > - opts_unpack_charp("event_plugin_name", > - p_key, p_val, &p_opts->event_plugin_name); > - > - opts_unpack_charp("node_name_map_name", > - p_key, p_val, &p_opts->node_name_map_name); > - > - subn_parse_qos_options("qos", > - p_key, p_val, &p_opts->qos_options); > - > - subn_parse_qos_options("qos_ca", > - p_key, p_val, &p_opts->qos_ca_options); > - > - subn_parse_qos_options("qos_sw0", > - p_key, p_val, &p_opts->qos_sw0_options); > - > - subn_parse_qos_options("qos_swe", > - p_key, p_val, &p_opts->qos_swe_options); > - > - subn_parse_qos_options("qos_rtr", > - p_key, p_val, &p_opts->qos_rtr_options); > - > - opts_unpack_boolean("enable_quirks", > - p_key, p_val, &p_opts->enable_quirks); > - > - opts_unpack_boolean("no_clients_rereg", > - p_key, p_val, &p_opts->no_clients_rereg); > - > - opts_unpack_charp("prefix_routes_file", > - p_key, p_val, &p_opts->prefix_routes_file); > + for (r = opt_tbl; r->name; r++) { > + if (strcmp(r->name, p_key)) > + continue; > > - opts_unpack_boolean("consolidate_ipv6_snm_req", p_key, p_val, > - &p_opts->consolidate_ipv6_snm_req); > + p_field = (char *)p_opts + r->field_offset; > + r->unpack_fn(NULL, p_key, > + p_val, p_field, r->update_fn); > + } > } > fclose(opts_file); > > @@ -1258,61 +1222,58 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) > > int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) > { > + osm_subn_opt_t *p_opts = &p_subn->opt; > + const opt_rec_t *r; > FILE *opts_file; > char line[1024]; > - char *p_key, *p_val, *p_last; > + char *p_key, *p_val; > + char *p_field; > > - if (!p_subn->opt.config_file) > + if (!p_opts->config_file) > return 0; > > - opts_file = fopen(p_subn->opt.config_file, "r"); > + opts_file = fopen(p_opts->config_file, "r"); > if (!opts_file) { > if (errno == ENOENT) > return 1; > OSM_LOG(&p_subn->p_osm->log, OSM_LOG_ERROR, > "cannot open file \'%s\': %s\n", > - p_subn->opt.config_file, strerror(errno)); > + p_opts->config_file, strerror(errno)); > return -1; > } > > - subn_free_qos_options(&p_subn->opt.qos_options); > - subn_free_qos_options(&p_subn->opt.qos_ca_options); > - subn_free_qos_options(&p_subn->opt.qos_sw0_options); > - subn_free_qos_options(&p_subn->opt.qos_swe_options); > - subn_free_qos_options(&p_subn->opt.qos_rtr_options); > + subn_free_qos_options(p_opts->qos); > + subn_free_qos_options(p_opts->qos_ca); > + subn_free_qos_options(p_opts->qos_sw0); > + subn_free_qos_options(p_opts->qos_swe); > + subn_free_qos_options(p_opts->qos_rtr); > > - subn_init_qos_options(&p_subn->opt.qos_options); > - subn_init_qos_options(&p_subn->opt.qos_ca_options); > - subn_init_qos_options(&p_subn->opt.qos_sw0_options); > - subn_init_qos_options(&p_subn->opt.qos_swe_options); > - subn_init_qos_options(&p_subn->opt.qos_rtr_options); > + subn_init_qos_options(p_opts->qos); > + subn_init_qos_options(p_opts->qos_ca); > + subn_init_qos_options(p_opts->qos_sw0); > + subn_init_qos_options(p_opts->qos_swe); > + subn_init_qos_options(p_opts->qos_rtr); > > while (fgets(line, 1023, opts_file) != NULL) { > /* get the first token */ > - p_key = strtok_r(line, " \t\n", &p_last); > - if (p_key) { > - p_val = strtok_r(NULL, " \t\n", &p_last); > - > - subn_parse_qos_options("qos", p_key, p_val, > - &p_subn->opt.qos_options); > - > - subn_parse_qos_options("qos_ca", p_key, p_val, > - &p_subn->opt.qos_ca_options); > - > - subn_parse_qos_options("qos_sw0", p_key, p_val, > - &p_subn->opt.qos_sw0_options); > + p_key = strtok_r(line, " \t\n", &p_val); > + if (!p_key) > + continue; > > - subn_parse_qos_options("qos_swe", p_key, p_val, > - &p_subn->opt.qos_swe_options); > + p_val = clean_val(p_val); > > - subn_parse_qos_options("qos_rtr", p_key, p_val, > - &p_subn->opt.qos_rtr_options); > + for (r = opt_tbl; r->name; r++) { > + if (!r->can_update || strcmp(r->name, p_key)) > + continue; > > + p_field = (char *)p_opts + r->field_offset; > + r->unpack_fn(p_subn, p_key, > + p_val, p_field, r->update_fn); > } > } > fclose(opts_file); > > - osm_subn_verify_config(&p_subn->opt); > + osm_subn_verify_config(p_opts); > > osm_parse_prefix_routes_file(p_subn); > > @@ -1640,23 +1601,23 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *const p_opts) > > subn_dump_qos_options(out, > "QoS default options", "qos", > - &p_opts->qos_options); > + p_opts->qos); > fprintf(out, "\n"); > subn_dump_qos_options(out, > "QoS CA options", "qos_ca", > - &p_opts->qos_ca_options); > + p_opts->qos_ca); > fprintf(out, "\n"); > subn_dump_qos_options(out, > "QoS Switch Port 0 options", "qos_sw0", > - &p_opts->qos_sw0_options); > + p_opts->qos_sw0); > fprintf(out, "\n"); > subn_dump_qos_options(out, > "QoS Switch external ports options", "qos_swe", > - &p_opts->qos_swe_options); > + p_opts->qos_swe); > fprintf(out, "\n"); > subn_dump_qos_options(out, > "QoS Router ports options", "qos_rtr", > - &p_opts->qos_rtr_options); > + p_opts->qos_rtr); > fprintf(out, "\n"); > > fprintf(out, > -- > 1.5.5 > From vlad at lists.openfabrics.org Thu Jan 22 03:20:14 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 22 Jan 2009 03:20:14 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090122-0200 daily build status Message-ID: <20090122112014.F06A0E601C7@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From dave at thedillows.org Thu Jan 22 04:59:20 2009 From: dave at thedillows.org (David Dillow) Date: Thu, 22 Jan 2009 07:59:20 -0500 Subject: [ofa-general] Does ib0 always map to port1? In-Reply-To: <2f3bf9a60901220018t91a1615tf73b4efdecb4f3c7@mail.gmail.com> References: <49777001.3050301@oracle.com> <49782409.1080208@Voltaire.com> <2f3bf9a60901220018t91a1615tf73b4efdecb4f3c7@mail.gmail.com> Message-ID: <1232629160.9832.2.camel@lap75545.ornl.gov> On Thu, 2009-01-22 at 10:18 +0200, Dotan Barak wrote: > On Thu, Jan 22, 2009 at 9:45 AM, Or Gerlitz wrote: > > Sumeet Lahorani wrote: > >> > >> I see that ib0 always maps to port1 and ib1 always maps to port2 on the > >> HCA. I'm trying to find out if this will always be the case and if so which > >> script ensures this mapping? > > > > Yes, on a dual ported HCA, ib0 maps to port1 and ib1 to port2, this is a > > property of the ipoib driver regardless of which HW is used, see > > drivers/infiniband/ulps/ipoib/ipoib_main :: ipoib_add_one() > > > > Part of the "MAC" address of the I/F is the port GUID, so you can use > this value to determine which HCA.port is mapped to that I/Fs... > (if there are several HCAs in the same host, this can be very useful ...) This is the better way to go for general use, as it is entirely possible for a user to change the device names such that ib0 swaps with ib1: ip link set ib0 name ib0_rename ip link set ib1 name ib0 ip link set ib0_rename ib1 Now, it is probably not very likely that a user will do this, but Dotan's right that it can get fun when there are multiple HCA's, so it is a good idea in any event. From sashak at voltaire.com Thu Jan 22 05:09:52 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 15:09:52 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/Makefile.am: kill -rpath Message-ID: <20090122130952.GW3479@sashak.voltaire.com> Kill -rpath usage in LDFLAGS. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/Makefile.am | 5 ----- 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index c22ba5e..91a7e9a 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -37,7 +37,6 @@ src_ibaddr_CFLAGS = -Wall $(DBGFLAGS) src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common.c src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS) -src_ibnetdiscover_LDFLAGS = -Wl,--rpath -Wl,$(libdir) src_ibping_SOURCES = src/ibping.c src/ibdiag_common.c src_ibping_CFLAGS = -Wall $(DBGFLAGS) @@ -56,7 +55,6 @@ src_ibsysstat_CFLAGS = -Wall $(DBGFLAGS) src_ibtracert_SOURCES = src/ibtracert.c src/ibdiag_common.c src_ibtracert_CFLAGS = -Wall $(DBGFLAGS) -src_ibtracert_LDFLAGS = -Wl,--rpath -Wl,$(libdir) src_perfquery_SOURCES = src/perfquery.c src/ibdiag_common.c src_perfquery_CFLAGS = -Wall $(DBGFLAGS) @@ -69,15 +67,12 @@ src_smpdump_CFLAGS = -Wall $(DBGFLAGS) src_smpquery_SOURCES = src/smpquery.c src/ibdiag_common.c src_smpquery_CFLAGS = -Wall $(DBGFLAGS) -src_smpquery_LDFLAGS = -Wl,--rpath -Wl,$(libdir) src_saquery_SOURCES = src/saquery.c src/ibdiag_common.c src_saquery_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -src_saquery_LDFLAGS = -Wl,--rpath -Wl,$(libdir) src_ibsendtrap_SOURCES = src/ibsendtrap.c src/ibdiag_common.c src_ibsendtrap_CFLAGS = -Wall $(DBGFLAGS) -src_ibsendtrap_LDFLAGS = -Wl,--rpath -Wl,$(libdir) src_vendstat_SOURCES = src/vendstat.c src/ibdiag_common.c src_vendstat_CFLAGS = -Wall $(DBGFLAGS) -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Thu Jan 22 05:10:28 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 15:10:28 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/Makefile.am: merge CFLAGS In-Reply-To: <20090122130952.GW3479@sashak.voltaire.com> References: <20090122130952.GW3479@sashak.voltaire.com> Message-ID: <20090122131028.GX3479@sashak.voltaire.com> Merge per tool CFLAGS - use common AM_CFLAGS definition. saquery doesn't need custom CFLAGS anymore since proper OpenSM vendor configuration is included via osm_vendor_api.h now. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/Makefile.am | 33 ++------------------------------- 1 files changed, 2 insertions(+), 31 deletions(-) diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index 91a7e9a..fe83beb 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -32,53 +32,24 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \ scripts/ibfindnodesusing.pl scripts/ibidsverify.pl \ scripts/check_lft_balance.pl -src_ibaddr_SOURCES = src/ibaddr.c src/ibdiag_common.c -src_ibaddr_CFLAGS = -Wall $(DBGFLAGS) +AM_CFLAGS = -Wall $(DBGFLAGS) +src_ibaddr_SOURCES = src/ibaddr.c src/ibdiag_common.c src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common.c -src_ibnetdiscover_CFLAGS = -Wall $(DBGFLAGS) - src_ibping_SOURCES = src/ibping.c src/ibdiag_common.c -src_ibping_CFLAGS = -Wall $(DBGFLAGS) - src_ibportstate_SOURCES = src/ibportstate.c src/ibdiag_common.c -src_ibportstate_CFLAGS = -Wall $(DBGFLAGS) - src_ibroute_SOURCES = src/ibroute.c src/ibdiag_common.c -src_ibroute_CFLAGS = -Wall $(DBGFLAGS) - src_ibstat_SOURCES = src/ibstat.c -src_ibstat_CFLAGS = -Wall $(DBGFLAGS) - src_ibsysstat_SOURCES = src/ibsysstat.c src/ibdiag_common.c -src_ibsysstat_CFLAGS = -Wall $(DBGFLAGS) - src_ibtracert_SOURCES = src/ibtracert.c src/ibdiag_common.c -src_ibtracert_CFLAGS = -Wall $(DBGFLAGS) - src_perfquery_SOURCES = src/perfquery.c src/ibdiag_common.c -src_perfquery_CFLAGS = -Wall $(DBGFLAGS) - src_sminfo_SOURCES = src/sminfo.c src/ibdiag_common.c -src_sminfo_CFLAGS = -Wall $(DBGFLAGS) - src_smpdump_SOURCES = src/smpdump.c -src_smpdump_CFLAGS = -Wall $(DBGFLAGS) - src_smpquery_SOURCES = src/smpquery.c src/ibdiag_common.c -src_smpquery_CFLAGS = -Wall $(DBGFLAGS) - src_saquery_SOURCES = src/saquery.c src/ibdiag_common.c -src_saquery_CFLAGS = -Wall -DOSM_VENDOR_INTF_OPENIB -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) - src_ibsendtrap_SOURCES = src/ibsendtrap.c src/ibdiag_common.c -src_ibsendtrap_CFLAGS = -Wall $(DBGFLAGS) - src_vendstat_SOURCES = src/vendstat.c src/ibdiag_common.c -src_vendstat_CFLAGS = -Wall $(DBGFLAGS) - src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c -src_mcm_rereg_test_CFLAGS = -Wall $(DBGFLAGS) man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ man/ibchecknet.8 man/ibchecknode.8 man/ibcheckport.8 \ -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Thu Jan 22 05:11:15 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 15:11:15 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/Makefile.am: use common library In-Reply-To: <20090122131028.GX3479@sashak.voltaire.com> References: <20090122130952.GW3479@sashak.voltaire.com> <20090122131028.GX3479@sashak.voltaire.com> Message-ID: <20090122131115.GY3479@sashak.voltaire.com> Use common static library libcommon.a for all tools. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/Makefile.am | 30 +++++++++++++++++------------- 1 files changed, 17 insertions(+), 13 deletions(-) diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index fe83beb..f9cc5bd 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -32,23 +32,27 @@ sbin_SCRIPTS = scripts/ibcheckerrs scripts/ibchecknet scripts/ibchecknode \ scripts/ibfindnodesusing.pl scripts/ibidsverify.pl \ scripts/check_lft_balance.pl +noinst_LIBRARIES = libcommon.a + AM_CFLAGS = -Wall $(DBGFLAGS) +LDADD = libcommon.a -src_ibaddr_SOURCES = src/ibaddr.c src/ibdiag_common.c -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c src/ibdiag_common.c -src_ibping_SOURCES = src/ibping.c src/ibdiag_common.c -src_ibportstate_SOURCES = src/ibportstate.c src/ibdiag_common.c -src_ibroute_SOURCES = src/ibroute.c src/ibdiag_common.c +libcommon_a_SOURCES = src/ibdiag_common.c +src_ibaddr_SOURCES = src/ibaddr.c +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c +src_ibping_SOURCES = src/ibping.c +src_ibportstate_SOURCES = src/ibportstate.c +src_ibroute_SOURCES = src/ibroute.c src_ibstat_SOURCES = src/ibstat.c -src_ibsysstat_SOURCES = src/ibsysstat.c src/ibdiag_common.c -src_ibtracert_SOURCES = src/ibtracert.c src/ibdiag_common.c -src_perfquery_SOURCES = src/perfquery.c src/ibdiag_common.c -src_sminfo_SOURCES = src/sminfo.c src/ibdiag_common.c +src_ibsysstat_SOURCES = src/ibsysstat.c +src_ibtracert_SOURCES = src/ibtracert.c +src_perfquery_SOURCES = src/perfquery.c +src_sminfo_SOURCES = src/sminfo.c src_smpdump_SOURCES = src/smpdump.c -src_smpquery_SOURCES = src/smpquery.c src/ibdiag_common.c -src_saquery_SOURCES = src/saquery.c src/ibdiag_common.c -src_ibsendtrap_SOURCES = src/ibsendtrap.c src/ibdiag_common.c -src_vendstat_SOURCES = src/vendstat.c src/ibdiag_common.c +src_smpquery_SOURCES = src/smpquery.c +src_saquery_SOURCES = src/saquery.c +src_ibsendtrap_SOURCES = src/ibsendtrap.c +src_vendstat_SOURCES = src/vendstat.c src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Thu Jan 22 05:18:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 15:18:12 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/ibdiag_common: move get_build_version() In-Reply-To: <20090122131115.GY3479@sashak.voltaire.com> References: <20090122130952.GW3479@sashak.voltaire.com> <20090122131028.GX3479@sashak.voltaire.com> <20090122131115.GY3479@sashak.voltaire.com> Message-ID: <20090122131812.GZ3479@sashak.voltaire.com> Move get_build_version() function to source file - so for version change only relink will be necessary for all tools rather than full rebuild. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/include/ibdiag_common.h | 8 +------- infiniband-diags/src/ibdiag_common.c | 6 ++++++ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h index 2c81d76..4cc9748 100644 --- a/infiniband-diags/include/ibdiag_common.h +++ b/infiniband-diags/include/ibdiag_common.h @@ -48,12 +48,6 @@ extern int ibdebug; #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) extern void iberror(const char *fn, char *msg, ...); - -#include - -static inline const char *get_build_version(void) -{ - return "BUILD VERSION: " IBDIAG_VERSION " Build date: " __DATE__ " " __TIME__; -} +extern const char *get_build_version(void); #endif /* _IBDIAG_COMMON_H_ */ diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index 37b60cf..ec22204 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -48,6 +48,7 @@ #include #include +#include int ibdebug; @@ -73,3 +74,8 @@ void iberror(const char *fn, char *msg, ...) exit(-1); } + +const char *get_build_version(void) +{ + return "BUILD VERSION: " IBDIAG_VERSION " Build date: " __DATE__ " " __TIME__; +} -- 1.6.0.4.766.g6fc4a From michael.heinz at qlogic.com Thu Jan 22 06:25:59 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Thu, 22 Jan 2009 08:25:59 -0600 Subject: [ofa-general] FW: [PATCH] mstvpd (resend) Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74E5CFFA@MNEXMB1.qlogic.org> I sent this a week ago, and never got any kind of response - this is a patch (included both inline and as an attachment) for mstvpd. Should it be directed someplace else? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania From: Mike Heinz Sent: Wednesday, January 14, 2009 1:42 PM To: 'general at lists.openfabrics.org' Subject: [PATCH] mstvpd (resend) We've repeatedly run into a problem where mstvpd can hang on certain HCA models, and on HCAs that have failed. This is an issue for us, because mstvpd is one of the tools we use to automatically capture information about a system that's experiencing problems. I previously opened PR 1440 on the problem, but it doesn't appear to have been investigated. For this reason, I'm proposing the attached patch. Basically, it adds a configurable time out and it terminates the attempt to read the VPD area if it fails to retrieve data before the time out expires. The default is 30 seconds. It uses a stupid busy-loop to check for time out because that's what the existing code does. Other changes were also made to support this change - I changed how command line options are processed and extended the usage() function. --- vpd.c.orig 2009-01-08 16:56:12.000000000 -0500 +++ vpd.c 2009-01-08 17:44:01.000000000 -0500 @@ -44,6 +44,13 @@ #include #include #include +#include + +/* pread is non-blocking, so we loop until we find data. Unfortunately, + * we can loop forever if the HCA is crashed or if the wrong device is + * specified as an argument. So, we set time outs. + */ +static clock_t ticks_per_sec, start_t, curr_t, timeout_t = 30; struct vpd_cap { unsigned char id; @@ -168,7 +175,13 @@ if (ret != sizeof addr_flag) return ret; + start_t = times(NULL); while((addr_flag[1] & VPD_FLAG) != VPD_FLAG_READ_READY) { + curr_t = times(NULL); + if ((curr_t - start_t) / ticks_per_sec > timeout_t) { + return -EIO; + } + ret = pread(device, addr_flag, sizeof addr_flag, vpd_cap_offset + VPD_ADDR_OFFSET); if (ret != sizeof addr_flag) @@ -437,24 +450,34 @@ rc = 1; goto usage; } - if (argc == 3) { - if (!strcmp("-m", argv[1])) { - argv++; - argc--; - m = 1; - } else if (!strcmp("-n", argv[1])) { - argv++; - argc--; - n = 1; - } else { - rc = 2; - goto usage; + + ticks_per_sec = sysconf(_SC_CLK_TCK); + + do + { + i=getopt(argc, argv, "mnt:"); + if (i<0) { + break; } - } - name = argv[1]; - argv++; - argc--; + switch (i) { + case 'm': + m=1; + break; + case 'n': + n=1; + break; + case 't': + timeout_t = strtol(optarg, NULL, 0); + break; + default: + goto usage; + } + } while (1 == 1); + + name = argv[optind]; + argc -= optind; + argv += optind; if (!strcmp("-", name)) { if (fread(d, VPD_MAX_SIZE, 1, stdin) != 1) @@ -486,6 +509,14 @@ return 0; usage: - fprintf(stderr, "Usage: %s [-m|-n] [-- keyword ...]\n", argv[0]); + fprintf(stderr, "Usage: %s [-m|-n] [-t ##] [-- keyword ...]\n", argv[0]); + fprintf(stderr, "-m\tDump raw VPD data to stdout.\n"); + fprintf(stderr, "-n\tDo not validate check sum.\n"); + fprintf(stderr, "-t ##\tTime out after ## seconds. (Default is 30.)\n\n"); + fprintf(stderr, "file\tThe PCI id number of the HCA (for example, \"2:00.0\"),\n"); + fprintf(stderr, "\tthe device name (such as \"mlx4_0\")\n"); + fprintf(stderr, "\tthe absolute path to the device (\"/sys/class/infiniband/mlx4_0/device\")\n"); + fprintf(stderr, "\tor '-' to read VPD data from the standard input.\n\n"); + fprintf(stderr, "keyword(s): Only display the requested information. (ID, PN, EC, SN, etc...)\n"); return rc; } -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mstvpd.patch Type: application/octet-stream Size: 2501 bytes Desc: mstvpd.patch URL: From jackm at dev.mellanox.co.il Thu Jan 22 06:30:57 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 22 Jan 2009 16:30:57 +0200 Subject: [ofa-general] Kernel panic in IPoIB (RHEL5.1) Message-ID: <200901221630.57419.jackm@dev.mellanox.co.il> We saw the following kernel panic when testing ipoib stability intensively by simultaneously (i.e., in separate processes, with random wait intervals) doing: - ifconfig up/down - opensm up/down - ipoib ping - arp delete - driver up/down Does anyone have ideas as to what might have happened? (the actual crash: is at: static inline struct sk_buff *__skb_dequeue(struct sk_buff_head *list) { struct sk_buff *next, *prev, *result; prev = (struct sk_buff *) list; next = prev->next; result = NULL; if (next != prev) { result = next; next = next->next; list->qlen--; ====> here ==> next->prev = prev; prev->next = next; result->next = result->prev = NULL; } return result; } This is called by: ipoib_neigh_free: void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh) { struct sk_buff *skb; *to_ipoib_neigh(neigh->neighbour) = NULL; ===> while ((skb = __skb_dequeue(&neigh->queue))) { Which is called by path_free: static void path_free(struct net_device *dev, struct ipoib_path *path) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tn; struct sk_buff *skb; unsigned long flags; while ((skb = __skb_dequeue(&path->queue))) dev_kfree_skb_irq(skb); spin_lock_irqsave(&priv->lock, flags); list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) { /* * It's safe to call ipoib_put_ah() inside priv->lock * here, because we know that path->ah will always * hold one more reference, so ipoib_put_ah() will * never do more than decrement the ref count. */ if (neigh->ah) ipoib_put_ah(neigh->ah); ===> HERE ipoib_neigh_free(dev, neigh); } CONSOLE DUMP ============== ib0: ib_sa_path_rec_get failed: -11 Unable to handle kernel NULL pointer dereference at 0000000000000009 RIP: [] :ib_ipoib:ipoib_neigh_free+0x2f/0x6e PGD 0 Oops: 0002 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:01.0/irq CPU 2 Modules linked in: netconsole nfs fscache nfsd exportfs lockd nfs_acl autofs4 hidp rfcomm l2cap bluetooth sunrpc rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg ide_cd pcspkr cdrom mlx4_core(U) bnx2 k8_edac k8temp hwmon serio_raw edac_mc shpchp sata_svw libata megaraid_sas sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 2195, comm: ipoib Not tainted 2.6.18-53.el5 #1 RIP: 0010:[] [] :ib_ipoib:ipoib_neigh_free+0x2f/0x6e RSP: 0018:ffff8102273c7d30 EFLAGS: 00010012 RAX: 0000000000000001 RBX: ffff81012b95f0c0 RCX: ffff81010acbdc20 RDX: ffff81012b95f0e0 RSI: ffff81012b95f0c0 RDI: ffffffff8840e7a0 RBP: ffff810122b54500 R08: ffff8102273c6000 R09: ffff810227eacd30 R10: ffff810227eacd18 R11: ffffffff883c00cf R12: ffff81010acbdbc0 R13: 0000000000000246 R14: ffff810122b54500 R15: ffff810122b54000 FS: 00002aaaab22bb00(0000) GS:ffff8101041593c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000009 CR3: 0000000000201000 CR4: 00000000000006e0 Process ipoib (pid: 2195, threadinfo ffff8102273c6000, task ffff810126b067a0) Stack: ffff810122b54500 ffff81010acbdbf0 ffff81010acbdbf0 ffffffff883f201d ffffffff800884ac ffff81010acbdbc0 ffff810122b54000 ffff8102273c7cc0 0000000000000246 ffff810122b54500 ffff810122b54280 ffffffff883f3559 Call Trace: [] :ib_ipoib:path_free+0xc7/0x116 [] default_wake_function+0x0/0xe [] :ib_ipoib:ipoib_flush_paths+0x117/0x186 [] :ib_ipoib:ipoib_ib_dev_flush_normal+0x0/0x11 [] :ib_ipoib:ipoib_ib_dev_down+0xac/0xb3 [] :ib_ipoib:__ipoib_ib_dev_flush+0x1a9/0x1b6 [] run_workqueue+0x94/0xe5 Code: 48 89 50 08 48 89 43 20 48 c7 47 08 00 00 00 00 48 c7 07 00 RIP [] :ib_ipoib:ipoib_neigh_free+0x2f/0x6e RSP CR2: 0000000000000009 <0>Kernel panic - not syncing: Fatal exception - Jack From sashak at voltaire.com Thu Jan 22 06:43:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 22 Jan 2009 16:43:33 +0200 Subject: [ofa-general] Question about perfmgr on OpenSM In-Reply-To: <49763AAE.1010605@morey-chaisemartin.com> References: <4970818E.8070602@ext.bull.net> <20090120105947.25b66481.weiny2@llnl.gov> <49763AAE.1010605@morey-chaisemartin.com> Message-ID: <20090122144333.GA3479@sashak.voltaire.com> On 21:57 Tue 20 Jan , Nicolas Morey-Chaisemartin wrote: > > Yes. For HA reasons, it is necessary for us to have a couple of OpenSM > running on the subnet. > > The plugins are managed via the OpenSM process. Each OpenSM which is started > > will load the plugins specified in the opensm.conf file. If those plugins are > > the same and would conflict, due to a common database for example, you will > > have to synchronize them yourself. Each SM will report events to all > > of it's plugins. > > > That's what I expected from reading the code. So even in STANDBY mode, > openSM will report all the counters (data, errors,...) ? If you will run PerfMgr on standby SM then yes - it will query fabric ports and will report it to plugin. > Or only traps? Most will go to master SM. Sasha From chien.tin.tung at intel.com Thu Jan 22 06:50:55 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Thu, 22 Jan 2009 07:50:55 -0700 Subject: [ofa-general] RE: [PATCH] RDMA/nes: Improved use of pbls In-Reply-To: References: <20090121171108.GA672@ctung-MOBL> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830320258BAE@azsmsx501.amr.corp.intel.com> > > Also, fixed two places where the software pbl counts were changed > > before the hardware was updated. This bug allowed another thread > > to overallocate the hardware resources. > >This patch is big enough that I think it needs to be deferred to 2.6.30 >given where 2.6.29 is in the release cycle. However if this problem >causes problems in practice maybe we should split it out and >get it into >2.6.29? I can split the patch into two to get the two small fixes in 2.6.29. But do consider the PBL improvements for 2.6.29 as we've been running with this code for the last two months on our cluster without issues. Two patches to come later today. Thanks for the feedback. Chien From tziporet at dev.mellanox.co.il Thu Jan 22 06:59:13 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 22 Jan 2009 16:59:13 +0200 Subject: [ofa-general] FW: [PATCH] mstvpd (resend) In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB3E74E5CFFA@MNEXMB1.qlogic.org> References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74E5CFFA@MNEXMB1.qlogic.org> Message-ID: <497889C1.6080201@mellanox.co.il> Mike Heinz wrote: > > I sent this a week ago, and never got any kind of response – this is a > patch (included both inline and as an attachment) for mstvpd. Should > it be directed someplace else? > Oren (mstvpd) is on vacation Will be back next week Tziporet From michael.heinz at qlogic.com Thu Jan 22 06:58:53 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Thu, 22 Jan 2009 08:58:53 -0600 Subject: [ofa-general] FW: [PATCH] mstvpd (resend) In-Reply-To: <497889C1.6080201@mellanox.co.il> References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74E5CFFA@MNEXMB1.qlogic.org> <497889C1.6080201@mellanox.co.il> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74E5D001@MNEXMB1.qlogic.org> Understood. Thanks. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Sent: Thursday, January 22, 2009 9:59 AM To: Mike Heinz Cc: general at lists.openfabrics.org; Oren Kladnitsky Subject: Re: [ofa-general] FW: [PATCH] mstvpd (resend) Mike Heinz wrote: > > I sent this a week ago, and never got any kind of response - this is a > patch (included both inline and as an attachment) for mstvpd. Should > it be directed someplace else? > Oren (mstvpd) is on vacation Will be back next week Tziporet From eli at mellanox.co.il Thu Jan 22 08:31:05 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 22 Jan 2009 18:31:05 +0200 Subject: [ofa-general] [PATCH v1] mlx4_ib: Optimize hugetlab pages support Message-ID: <20090122163105.GA23496@mtls03> Since Linux may not merge adjacent pages into a single scatter entry through calls to dma_map_sg(), we check the special case of hugetlb pages which are likely to be mapped to coniguous dma addresses and if they are, take advantage of this. This will result in a significantly lower number of MTT segments used for registering hugetlb memory regions. Signed-off-by: Eli Cohen --- In this version I also took care of the case where the kernel is compiled without hugetlb support. drivers/infiniband/hw/mlx4/mr.c | 86 ++++++++++++++++++++++++++++++++++----- 1 files changed, 75 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 8e4d26d..4c7a5bf 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -119,6 +119,68 @@ out: return err; } +static int handle_hugetlb_user_mr(struct ib_pd *pd, struct mlx4_ib_mr *mr, + u64 virt_addr, int access_flags) +{ +#ifdef CONFIG_HUGETLB_PAGE + struct mlx4_ib_dev *dev = to_mdev(pd->device); + struct ib_umem_chunk *chunk; + unsigned dsize; + dma_addr_t daddr; + unsigned uninitialized_var(cur_size); + dma_addr_t uninitialized_var(cur_addr); + int restart; + int n; + struct ib_umem *umem = mr->umem; + u64 *arr; + int err = 0; + int i; + int j = 0; + + n = PAGE_ALIGN(umem->length + umem->offset) >> HPAGE_SHIFT; + arr = kmalloc(n * sizeof *arr, GFP_KERNEL); + if (!arr) + return -ENOMEM; + + restart = 1; + list_for_each_entry(chunk, &umem->chunk_list, list) + for (i = 0; i < chunk->nmap; ++i) { + daddr = sg_dma_address(&chunk->page_list[i]); + dsize = sg_dma_len(&chunk->page_list[i]); + if (restart) { + cur_addr = daddr; + cur_size = dsize; + restart = 0; + } else if (cur_addr + cur_size != daddr) { + err = -EINVAL; + goto out; + } else + cur_size += dsize; + + if (cur_size > HPAGE_SIZE) { + err = -EINVAL; + goto out; + } else if (cur_size == HPAGE_SIZE) { + restart = 1; + arr[j++] = cur_addr; + } + } + + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, umem->length, + convert_access(access_flags), n, HPAGE_SHIFT, &mr->mmr); + if (err) + goto out; + + err = mlx4_write_mtt(dev->dev, &mr->mmr.mtt, 0, n, arr); + +out: + kfree(arr); + return err; +#else + return -ENOSYS; +#endif +} + struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt_addr, int access_flags, struct ib_udata *udata) @@ -140,17 +202,19 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, goto err_free; } - n = ib_umem_page_count(mr->umem); - shift = ilog2(mr->umem->page_size); - - err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, - convert_access(access_flags), n, shift, &mr->mmr); - if (err) - goto err_umem; - - err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); - if (err) - goto err_mr; + if (!mr->umem->hugetlb || handle_hugetlb_user_mr(pd, mr, virt_addr, access_flags)) { + n = ib_umem_page_count(mr->umem); + shift = ilog2(mr->umem->page_size); + + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, + convert_access(access_flags), n, shift, &mr->mmr); + if (err) + goto err_umem; + + err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); + if (err) + goto err_mr; + } err = mlx4_mr_enable(dev->dev, &mr->mmr); if (err) -- 1.6.0.5 From Sumeet.Lahorani at oracle.com Thu Jan 22 08:56:41 2009 From: Sumeet.Lahorani at oracle.com (Sumeet Lahorani) Date: Thu, 22 Jan 2009 08:56:41 -0800 Subject: [ofa-general] Does ib0 always map to port1? In-Reply-To: <1232629160.9832.2.camel@lap75545.ornl.gov> References: <49777001.3050301@oracle.com> <49782409.1080208@Voltaire.com> <2f3bf9a60901220018t91a1615tf73b4efdecb4f3c7@mail.gmail.com> <1232629160.9832.2.camel@lap75545.ornl.gov> Message-ID: <4978A549.8030606@oracle.com> Thanks all. When I look at the ifconfig ib0 output, it looks as though only the first 6 bytes of the HWaddr are printed correctly and the rest are 0's. ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet6 addr: fe80::21e:bff:ff4c:8a2d/64 Scope:Link UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:7359 errors:0 dropped:0 overruns:0 frame:0 TX packets:7054 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:974584 (951.7 KiB) TX bytes:562068 (548.8 KiB) The IPv6 address seems to more closely match the GUID of the port # ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:001e:0bff:ff4c:8a2d base lid: 0x1f sm lid: 0x4 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) But this is not an exact match either (0x21e vs 0x1e). Is there another command (apart from ifconfig) I can use to get a better match? - Sumeet David Dillow wrote: > On Thu, 2009-01-22 at 10:18 +0200, Dotan Barak wrote: > >> On Thu, Jan 22, 2009 at 9:45 AM, Or Gerlitz wrote: >> >>> Sumeet Lahorani wrote: >>> >>>> I see that ib0 always maps to port1 and ib1 always maps to port2 on the >>>> HCA. I'm trying to find out if this will always be the case and if so which >>>> script ensures this mapping? >>>> >>> Yes, on a dual ported HCA, ib0 maps to port1 and ib1 to port2, this is a >>> property of the ipoib driver regardless of which HW is used, see >>> drivers/infiniband/ulps/ipoib/ipoib_main :: ipoib_add_one() >>> >>> >> Part of the "MAC" address of the I/F is the port GUID, so you can use >> this value to determine which HCA.port is mapped to that I/Fs... >> (if there are several HCAs in the same host, this can be very useful ...) >> > > This is the better way to go for general use, as it is entirely possible > for a user to change the device names such that ib0 swaps with ib1: > > ip link set ib0 name ib0_rename > ip link set ib1 name ib0 > ip link set ib0_rename ib1 > > Now, it is probably not very likely that a user will do this, but > Dotan's right that it can get fun when there are multiple HCA's, so it > is a good idea in any event. > From or.gerlitz at gmail.com Thu Jan 22 09:28:24 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Thu, 22 Jan 2009 19:28:24 +0200 Subject: ***SPAM*** Re: [ofa-general] Does ib0 always map to port1? In-Reply-To: <4978A549.8030606@oracle.com> References: <49777001.3050301@oracle.com> <49782409.1080208@Voltaire.com> <2f3bf9a60901220018t91a1615tf73b4efdecb4f3c7@mail.gmail.com> <1232629160.9832.2.camel@lap75545.ornl.gov> <4978A549.8030606@oracle.com> Message-ID: <15ddcffd0901220928v61986fa4q8396e704b9d0a38d@mail.gmail.com> On Thu, Jan 22, 2009 at 6:56 PM, Sumeet Lahorani wrote: > > Thanks all. When I look at the ifconfig ib0 output, it looks as though only > the first 6 bytes of the HWaddr are printed correctly and the rest are 0's. please use /sbin/ip # ip addr show ib0 69: ib0: mtu 65520 qdisc pfifo_fast qlen 256 link/infiniband 80:6a:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e3 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.10.0.90/16 brd 10.10.255.255 scope global ib0 inet6 fe80::202:c903:2:6be3/64 scope link valid_lft forever preferred_lft forever # ip neigh show dev ib0 10.10.5.157 lladdr 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e7 REACHABLE Or -------------- next part -------------- An HTML attachment was scrubbed... URL: From dotanba at gmail.com Thu Jan 22 09:28:57 2009 From: dotanba at gmail.com (Dotan Barak) Date: Thu, 22 Jan 2009 19:28:57 +0200 Subject: [ofa-general] Does ib0 always map to port1? In-Reply-To: <4978A549.8030606@oracle.com> References: <49777001.3050301@oracle.com> <49782409.1080208@Voltaire.com> <2f3bf9a60901220018t91a1615tf73b4efdecb4f3c7@mail.gmail.com> <1232629160.9832.2.camel@lap75545.ornl.gov> <4978A549.8030606@oracle.com> Message-ID: <4978ACD9.4040206@gmail.com> Try to check the value of the MAC address using "ip addr" (i don't remember the full command) and not using ifconfig. This will be more accurate. Dotan Sumeet Lahorani wrote: > > Thanks all. When I look at the ifconfig ib0 output, it looks as though > only the first 6 bytes of the HWaddr are printed correctly and the > rest are 0's. > > ib0 Link encap:InfiniBand HWaddr > 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet6 addr: fe80::21e:bff:ff4c:8a2d/64 Scope:Link > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:7359 errors:0 dropped:0 overruns:0 frame:0 > TX packets:7054 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:974584 (951.7 KiB) TX bytes:562068 (548.8 KiB) > > The IPv6 address seems to more closely match the GUID of the port > > # ibstatus > Infiniband device 'mlx4_0' port 1 status: > default gid: fe80:0000:0000:0000:001e:0bff:ff4c:8a2d > base lid: 0x1f > sm lid: 0x4 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > > But this is not an exact match either (0x21e vs 0x1e). Is there > another command (apart from ifconfig) I can use to get a better match? > > - Sumeet > > David Dillow wrote: >> On Thu, 2009-01-22 at 10:18 +0200, Dotan Barak wrote: >> >>> On Thu, Jan 22, 2009 at 9:45 AM, Or Gerlitz >>> wrote: >>> >>>> Sumeet Lahorani wrote: >>>> >>>>> I see that ib0 always maps to port1 and ib1 always maps to port2 >>>>> on the >>>>> HCA. I'm trying to find out if this will always be the case and if >>>>> so which >>>>> script ensures this mapping? >>>>> >>>> Yes, on a dual ported HCA, ib0 maps to port1 and ib1 to port2, this >>>> is a >>>> property of the ipoib driver regardless of which HW is used, see >>>> drivers/infiniband/ulps/ipoib/ipoib_main :: ipoib_add_one() >>>> >>>> >>> Part of the "MAC" address of the I/F is the port GUID, so you can use >>> this value to determine which HCA.port is mapped to that I/Fs... >>> (if there are several HCAs in the same host, this can be very useful >>> ...) >>> >> >> This is the better way to go for general use, as it is entirely possible >> for a user to change the device names such that ib0 swaps with ib1: >> >> ip link set ib0 name ib0_rename >> ip link set ib1 name ib0 >> ip link set ib0_rename ib1 >> >> Now, it is probably not very likely that a user will do this, but >> Dotan's right that it can get fun when there are multiple HCA's, so it >> is a good idea in any event. >> > From weiny2 at llnl.gov Thu Jan 22 11:23:46 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 22 Jan 2009 11:23:46 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH V2 1/3] Create a new library libibnetdisc In-Reply-To: <20090121103359.GA3479@sashak.voltaire.com> References: <20081211162031.0c591f54.weiny2@llnl.gov> <1230056943.23747.21.camel@auk31.llnl.gov> <20081223184331.GL31213@obsidianresearch.com> <20081223164141.241dd3f0.weiny2@llnl.gov> <20090121103359.GA3479@sashak.voltaire.com> Message-ID: <20090122112346.27cf5801.weiny2@llnl.gov> On Wed, 21 Jan 2009 12:33:59 +0200 Sasha Khapyorsky wrote: > On 16:41 Tue 23 Dec , Ira Weiny wrote: > > > > +#define IBND_ERROR(...) \ > > + { \ > > + fprintf(stderr, "%s:%d; ", __FILE__, __LINE__); \ > > + fprintf(stderr, __VA_ARGS__); \ > > + } > > As far as know macro like this (using '##' for var args and without > breaking this in two) will work fine with both gcc and VC: > > #define IBND_ERROR(fmt, ...) \ > fprintf(stderr, "%s:%d: " fmt, __FILE__, __LINE__, ## __VA_ARGS__) > It was my understanding we were going for C99 compliance. If this is not the case I will change it, I prefer the ##. Ira From john.russo at qlogic.com Thu Jan 22 12:37:20 2009 From: john.russo at qlogic.com (John Russo) Date: Thu, 22 Jan 2009 14:37:20 -0600 Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x Message-ID: Does the release of RHEL 5.3 create any additional justification for a maintenance release of OFED (1.4.1) to be generated? I am already hearing requests for an OFED release that will support it. John Russo QLogic -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Thu Jan 22 12:43:47 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 22 Jan 2009 12:43:47 -0800 Subject: [ofa-general] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: References: Message-ID: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> In the last EWG meeting, we discussed waiting a month or so and seeing what kind of bugs were reported against 1.4 to determine if a 1.4.1 release was needed. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of John Russo Sent: Thursday, January 22, 2009 12:37 PM To: general at lists.openfabrics.org Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x Does the release of RHEL 5.3 create any additional justification for a maintenance release of OFED (1.4.1) to be generated? I am already hearing requests for an OFED release that will support it. John Russo QLogic From john.russo at qlogic.com Thu Jan 22 12:44:30 2009 From: john.russo at qlogic.com (John Russo) Date: Thu, 22 Jan 2009 14:44:30 -0600 Subject: [ofa-general] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> Message-ID: I understand but I think that this is another consideration that should be factored in. Even if there are no "critical" PRs to fix, the introduction of RHEL 5.3 (along with less critical PRs) may be enough justification. I simply want to plant the seed in everyone's mind before our next meeting. Thanks -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Thursday, January 22, 2009 3:44 PM To: John Russo; general at lists.openfabrics.org Cc: ewg at lists.openfabrics.org Subject: RE: RHEL 5.3 and OFED 1.4.x In the last EWG meeting, we discussed waiting a month or so and seeing what kind of bugs were reported against 1.4 to determine if a 1.4.1 release was needed. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of John Russo Sent: Thursday, January 22, 2009 12:37 PM To: general at lists.openfabrics.org Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x Does the release of RHEL 5.3 create any additional justification for a maintenance release of OFED (1.4.1) to be generated? I am already hearing requests for an OFED release that will support it. John Russo QLogic From robert.j.woodruff at intel.com Thu Jan 22 12:46:53 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 22 Jan 2009 12:46:53 -0800 Subject: [ofa-general] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F41C6FB9F9@orsmsx507.amr.corp.intel.com> Good point. Thanks for bringing it up and we can discuss it at the next EWG meeting. woody -----Original Message----- From: John Russo [mailto:john.russo at qlogic.com] Sent: Thursday, January 22, 2009 12:45 PM To: Woodruff, Robert J; general at lists.openfabrics.org Cc: ewg at lists.openfabrics.org Subject: RE: RHEL 5.3 and OFED 1.4.x I understand but I think that this is another consideration that should be factored in. Even if there are no "critical" PRs to fix, the introduction of RHEL 5.3 (along with less critical PRs) may be enough justification. I simply want to plant the seed in everyone's mind before our next meeting. Thanks -----Original Message----- From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] Sent: Thursday, January 22, 2009 3:44 PM To: John Russo; general at lists.openfabrics.org Cc: ewg at lists.openfabrics.org Subject: RE: RHEL 5.3 and OFED 1.4.x In the last EWG meeting, we discussed waiting a month or so and seeing what kind of bugs were reported against 1.4 to determine if a 1.4.1 release was needed. ________________________________ From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of John Russo Sent: Thursday, January 22, 2009 12:37 PM To: general at lists.openfabrics.org Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x Does the release of RHEL 5.3 create any additional justification for a maintenance release of OFED (1.4.1) to be generated? I am already hearing requests for an OFED release that will support it. John Russo QLogic From swise at opengridcomputing.com Thu Jan 22 13:45:31 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 22 Jan 2009 15:45:31 -0600 Subject: [ofa-general] Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> Message-ID: <4978E8FB.5040909@opengridcomputing.com> I think releasing OMPI-1.3 with iWARP support is also good justification. And there are RDS issues with ofed-1.4 even over IB that I think will add to justification. John Russo wrote: > I understand but I think that this is another consideration that should be factored in. Even if there are no "critical" PRs to fix, the introduction of RHEL 5.3 (along with less critical PRs) may be enough justification. > > I simply want to plant the seed in everyone's mind before our next meeting. > > Thanks > > -----Original Message----- > From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] > Sent: Thursday, January 22, 2009 3:44 PM > To: John Russo; general at lists.openfabrics.org > Cc: ewg at lists.openfabrics.org > Subject: RE: RHEL 5.3 and OFED 1.4.x > > In the last EWG meeting, we discussed waiting a month or so and seeing what kind of bugs > were reported against 1.4 to determine if a 1.4.1 release was needed. > > > ________________________________ > > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of John Russo > Sent: Thursday, January 22, 2009 12:37 PM > To: general at lists.openfabrics.org > Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x > > > > Does the release of RHEL 5.3 create any additional justification for a maintenance release of OFED (1.4.1) to be generated? I am already hearing requests for an OFED release that will support it. > > > > John Russo > > QLogic > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From robert.j.woodruff at intel.com Thu Jan 22 14:01:45 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 22 Jan 2009 14:01:45 -0800 Subject: [ofa-general] RE: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <4978E8FB.5040909@opengridcomputing.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> I think that we need to discuss this in the EWG meeting. In the past I think that we have agreed to only do bug fixes in point release and not add major new features. If we do want to include the new MPI, then perhaps we should call it 1.5 and pull in the schedule for 1.5. Just a thought. woody -----Original Message----- From: Steve Wise [mailto:swise at opengridcomputing.com] Sent: Thursday, January 22, 2009 1:46 PM To: John Russo Cc: Woodruff, Robert J; general at lists.openfabrics.org; ewg at lists.openfabrics.org Subject: Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x I think releasing OMPI-1.3 with iWARP support is also good justification. And there are RDS issues with ofed-1.4 even over IB that I think will add to justification. John Russo wrote: > I understand but I think that this is another consideration that should be factored in. Even if there are no "critical" PRs to fix, the introduction of RHEL 5.3 (along with less critical PRs) may be enough justification. > > I simply want to plant the seed in everyone's mind before our next meeting. > > Thanks > > -----Original Message----- > From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] > Sent: Thursday, January 22, 2009 3:44 PM > To: John Russo; general at lists.openfabrics.org > Cc: ewg at lists.openfabrics.org > Subject: RE: RHEL 5.3 and OFED 1.4.x > > In the last EWG meeting, we discussed waiting a month or so and seeing what kind of bugs > were reported against 1.4 to determine if a 1.4.1 release was needed. > > > ________________________________ > > From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of John Russo > Sent: Thursday, January 22, 2009 12:37 PM > To: general at lists.openfabrics.org > Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x > > > > Does the release of RHEL 5.3 create any additional justification for a maintenance release of OFED (1.4.1) to be generated? I am already hearing requests for an OFED release that will support it. > > > > John Russo > > QLogic > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From swise at opengridcomputing.com Thu Jan 22 14:07:10 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 22 Jan 2009 16:07:10 -0600 Subject: [ofa-general] Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> Message-ID: <4978EE0E.5050209@opengridcomputing.com> I understand the desire to not release new features in a point release, but at the same time, these features are ready or near ready now. And prior features have definitely been released in point releases. (connectX for example). Another key point is that these features do not need the kernel rebase that will happen with ofed-1.5, which will take months... Just more thoughts. :) Steve. Woodruff, Robert J wrote: > I think that we need to discuss this in the EWG meeting. > In the past I think that we have agreed to only do bug fixes > in point release and not add major new features. > If we do want to include the new MPI, then perhaps we should call > it 1.5 and pull in the schedule for 1.5. Just a thought. > > woody > > > -----Original Message----- > From: Steve Wise [mailto:swise at opengridcomputing.com] > Sent: Thursday, January 22, 2009 1:46 PM > To: John Russo > Cc: Woodruff, Robert J; general at lists.openfabrics.org; ewg at lists.openfabrics.org > Subject: Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x > > I think releasing OMPI-1.3 with iWARP support is also good justification. > > And there are RDS issues with ofed-1.4 even over IB that I think will > add to justification. > > > John Russo wrote: > >> I understand but I think that this is another consideration that should be factored in. Even if there are no "critical" PRs to fix, the introduction of RHEL 5.3 (along with less critical PRs) may be enough justification. >> >> I simply want to plant the seed in everyone's mind before our next meeting. >> >> Thanks >> >> -----Original Message----- >> From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] >> Sent: Thursday, January 22, 2009 3:44 PM >> To: John Russo; general at lists.openfabrics.org >> Cc: ewg at lists.openfabrics.org >> Subject: RE: RHEL 5.3 and OFED 1.4.x >> >> In the last EWG meeting, we discussed waiting a month or so and seeing what kind of bugs >> were reported against 1.4 to determine if a 1.4.1 release was needed. >> >> >> ________________________________ >> >> From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of John Russo >> Sent: Thursday, January 22, 2009 12:37 PM >> To: general at lists.openfabrics.org >> Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x >> >> >> >> Does the release of RHEL 5.3 create any additional justification for a maintenance release of OFED (1.4.1) to be generated? I am already hearing requests for an OFED release that will support it. >> >> >> >> John Russo >> >> QLogic >> >> _______________________________________________ >> ewg mailing list >> ewg at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg >> >> From jsquyres at cisco.com Thu Jan 22 14:16:47 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 22 Jan 2009 17:16:47 -0500 Subject: [ofa-general] Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <4978EE0E.5050209@opengridcomputing.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> <4978EE0E.5050209@opengridcomputing.com> Message-ID: Also, FWIW, it has been discussed (and agreed, I thought) to include OMPI v1.3 in a 1.4.x release. On Jan 22, 2009, at 5:07 PM, Steve Wise wrote: > > I understand the desire to not release new features in a point > release, but at the same time, these features are ready or near > ready now. And prior features have definitely been released in > point releases. (connectX for example). Another key point is that > these features do not need the kernel rebase that will happen with > ofed-1.5, which will take months... > > Just more thoughts. :) > > Steve. > > > Woodruff, Robert J wrote: >> I think that we need to discuss this in the EWG meeting. >> In the past I think that we have agreed to only do bug fixes >> in point release and not add major new features. >> If we do want to include the new MPI, then perhaps we should call >> it 1.5 and pull in the schedule for 1.5. Just a thought. >> >> woody >> >> >> -----Original Message----- >> From: Steve Wise [mailto:swise at opengridcomputing.com] >> Sent: Thursday, January 22, 2009 1:46 PM >> To: John Russo >> Cc: Woodruff, Robert J; general at lists.openfabrics.org; ewg at lists.openfabrics.org >> Subject: Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x >> >> I think releasing OMPI-1.3 with iWARP support is also good >> justification. >> >> And there are RDS issues with ofed-1.4 even over IB that I think will >> add to justification. >> >> >> John Russo wrote: >> >>> I understand but I think that this is another consideration that >>> should be factored in. Even if there are no "critical" PRs to >>> fix, the introduction of RHEL 5.3 (along with less critical PRs) >>> may be enough justification. >>> >>> I simply want to plant the seed in everyone's mind before our next >>> meeting. >>> >>> Thanks >>> >>> -----Original Message----- >>> From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] >>> Sent: Thursday, January 22, 2009 3:44 PM >>> To: John Russo; general at lists.openfabrics.org >>> Cc: ewg at lists.openfabrics.org >>> Subject: RE: RHEL 5.3 and OFED 1.4.x >>> >>> In the last EWG meeting, we discussed waiting a month or so and >>> seeing what kind of bugs >>> were reported against 1.4 to determine if a 1.4.1 release was >>> needed. >>> >>> >>> ________________________________ >>> >>> From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org >>> ] On Behalf Of John Russo >>> Sent: Thursday, January 22, 2009 12:37 PM >>> To: general at lists.openfabrics.org >>> Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x >>> >>> >>> >>> Does the release of RHEL 5.3 create any additional justification >>> for a maintenance release of OFED (1.4.1) to be generated? I am >>> already hearing requests for an OFED release that will support it. >>> >>> >>> >>> John Russo >>> >>> QLogic >>> >>> _______________________________________________ >>> ewg mailing list >>> ewg at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg >>> >>> > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems From robert.j.woodruff at intel.com Thu Jan 22 14:41:46 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 22 Jan 2009 14:41:46 -0800 Subject: [ofa-general] RE: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> <4978EE0E.5050209@opengridcomputing.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F41C6FBC34@orsmsx507.amr.corp.intel.com> Personally I do not have a problem with including it, since MPI is an isolated component and does not effect the core stack, but I thought that we had discussed in Sonoma last year not including major new features in point releases to reduce the QA that is needed. And, in general I think that is the way that kernel.org works, point releases are just for bug fixes. In any case, lets discuss it again in the EWG on Monday. woody -----Original Message----- From: Jeff Squyres [mailto:jsquyres at cisco.com] Sent: Thursday, January 22, 2009 2:17 PM To: Steve Wise Cc: Woodruff, Robert J; general at lists.openfabrics.org; ewg at lists.openfabrics.org Subject: Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x Also, FWIW, it has been discussed (and agreed, I thought) to include OMPI v1.3 in a 1.4.x release. On Jan 22, 2009, at 5:07 PM, Steve Wise wrote: > > I understand the desire to not release new features in a point > release, but at the same time, these features are ready or near > ready now. And prior features have definitely been released in > point releases. (connectX for example). Another key point is that > these features do not need the kernel rebase that will happen with > ofed-1.5, which will take months... > > Just more thoughts. :) > > Steve. > > > Woodruff, Robert J wrote: >> I think that we need to discuss this in the EWG meeting. >> In the past I think that we have agreed to only do bug fixes >> in point release and not add major new features. >> If we do want to include the new MPI, then perhaps we should call >> it 1.5 and pull in the schedule for 1.5. Just a thought. >> >> woody >> >> >> -----Original Message----- >> From: Steve Wise [mailto:swise at opengridcomputing.com] >> Sent: Thursday, January 22, 2009 1:46 PM >> To: John Russo >> Cc: Woodruff, Robert J; general at lists.openfabrics.org; ewg at lists.openfabrics.org >> Subject: Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x >> >> I think releasing OMPI-1.3 with iWARP support is also good >> justification. >> >> And there are RDS issues with ofed-1.4 even over IB that I think will >> add to justification. >> >> >> John Russo wrote: >> >>> I understand but I think that this is another consideration that >>> should be factored in. Even if there are no "critical" PRs to >>> fix, the introduction of RHEL 5.3 (along with less critical PRs) >>> may be enough justification. >>> >>> I simply want to plant the seed in everyone's mind before our next >>> meeting. >>> >>> Thanks >>> >>> -----Original Message----- >>> From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] >>> Sent: Thursday, January 22, 2009 3:44 PM >>> To: John Russo; general at lists.openfabrics.org >>> Cc: ewg at lists.openfabrics.org >>> Subject: RE: RHEL 5.3 and OFED 1.4.x >>> >>> In the last EWG meeting, we discussed waiting a month or so and >>> seeing what kind of bugs >>> were reported against 1.4 to determine if a 1.4.1 release was >>> needed. >>> >>> >>> ________________________________ >>> >>> From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org >>> ] On Behalf Of John Russo >>> Sent: Thursday, January 22, 2009 12:37 PM >>> To: general at lists.openfabrics.org >>> Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x >>> >>> >>> >>> Does the release of RHEL 5.3 create any additional justification >>> for a maintenance release of OFED (1.4.1) to be generated? I am >>> already hearing requests for an OFED release that will support it. >>> >>> >>> >>> John Russo >>> >>> QLogic >>> >>> _______________________________________________ >>> ewg mailing list >>> ewg at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg >>> >>> > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Jeff Squyres Cisco Systems From jgunthorpe at obsidianresearch.com Thu Jan 22 16:09:32 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 22 Jan 2009 17:09:32 -0700 Subject: [ofa-general] Does ib0 always map to port1? In-Reply-To: <4978ACD9.4040206@gmail.com> References: <49777001.3050301@oracle.com> <49782409.1080208@Voltaire.com> <2f3bf9a60901220018t91a1615tf73b4efdecb4f3c7@mail.gmail.com> <1232629160.9832.2.camel@lap75545.ornl.gov> <4978A549.8030606@oracle.com> <4978ACD9.4040206@gmail.com> Message-ID: <20090123000932.GS7618@obsidianresearch.com> On Thu, Jan 22, 2009 at 07:28:57PM +0200, Dotan Barak wrote: > Try to check the value of the MAC address using "ip addr" (i don't remember > the full command) and not using ifconfig. > This will be more accurate. ip link show ib0 You can also just look in /sys/class/net/ib0/address and /sys/class/net/ib0/device will at least tell you which HCA has the port. I wonder if it would be hard to add a sysfs link to something like /sys/class/infiniband/mthca0/ports/1/ ?? Jason From rdreier at cisco.com Thu Jan 22 21:07:41 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 22 Jan 2009 21:07:41 -0800 Subject: [ofa-general] Re: [PATCH v1] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <20090122163105.GA23496@mtls03> (Eli Cohen's message of "Thu, 22 Jan 2009 18:31:05 +0200") References: <20090122163105.GA23496@mtls03> Message-ID: OK, looks better. However the patch had a bunch of whitespace problems (run checkpatch.pl to see them). Also: > +static int handle_hugetlb_user_mr(struct ib_pd *pd, struct mlx4_ib_mr *mr, > + u64 virt_addr, int access_flags) > +{ > +#ifdef CONFIG_HUGETLB_PAGE > + struct mlx4_ib_dev *dev = to_mdev(pd->device); > + struct ib_umem_chunk *chunk; > + unsigned dsize; > + dma_addr_t daddr; > + unsigned uninitialized_var(cur_size); > + dma_addr_t uninitialized_var(cur_addr); > + int restart; > + int n; > + struct ib_umem *umem = mr->umem; > + u64 *arr; > + int err = 0; > + int i; > + int j = 0; > + > + n = PAGE_ALIGN(umem->length + umem->offset) >> HPAGE_SHIFT; seems this might underestimate by 1 if the region doesn't start/end on a huge-page aligned boundary (but we would still want to use big pages to register it). > + arr = kmalloc(n * sizeof *arr, GFP_KERNEL); > + if (!arr) > + return -ENOMEM; > + > + restart = 1; > + list_for_each_entry(chunk, &umem->chunk_list, list) > + for (i = 0; i < chunk->nmap; ++i) { > + daddr = sg_dma_address(&chunk->page_list[i]); > + dsize = sg_dma_len(&chunk->page_list[i]); > + if (restart) { > + cur_addr = daddr; > + cur_size = dsize; > + restart = 0; > + } else if (cur_addr + cur_size != daddr) { > + err = -EINVAL; > + goto out; > + } else > + cur_size += dsize; > + > + if (cur_size > HPAGE_SIZE) { > + err = -EINVAL; > + goto out; > + } else if (cur_size == HPAGE_SIZE) { > + restart = 1; > + arr[j++] = cur_addr; > + } > + } I think we could avoid the uninitialized_var() stuff and having restart at all by just doing cur_size = 0 at the start of the loop, and then instead of if (restart) just test if cur_size is 0. - R. From rdreier at cisco.com Thu Jan 22 21:10:07 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 22 Jan 2009 21:10:07 -0800 Subject: [ofa-general] Re: [PATCH] RDMA/nes: Improved use of pbls In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA3830320258BAE@azsmsx501.amr.corp.intel.com> (Chien Tin Tung's message of "Thu, 22 Jan 2009 07:50:55 -0700") References: <20090121171108.GA672@ctung-MOBL> <60BEFF3FBD4C6047B0F13F205CAFA3830320258BAE@azsmsx501.amr.corp.intel.com> Message-ID: > I can split the patch into two to get the two small fixes in 2.6.29. > But do consider the PBL improvements for 2.6.29 as we've been running > with this code for the last two months on our cluster without issues. If you had the code two months ago, why not send it then, or even a month ago, before the 2.6.29 merge window opened? Right now patches going into 2.6.29 need to be "fixes" not "improvements". - R. From jeff at splitrockpr.com Thu Jan 22 21:25:14 2009 From: jeff at splitrockpr.com (Jeffrey Scott) Date: Thu, 22 Jan 2009 21:25:14 -0800 Subject: [ofa-general] invitation to OFA Sonoma Workshop Message-ID: The OFA is hosting the 5th Annual International Sonoma Workshop from March 22-25 at The Lodge at Sonoma. To kickoff the event, OFA is hosting a special presentation on the evening of March 22. Registration for the Sonoma Workshop costs $595. An Early Bird discount is available through February 23. The discounted rate is $495. You may register for the event and book a hotel room here . An invitation to the event is attached. Please forward this to co-workers, professional acquaintances, customers, partners and anyone else who might be interested in attending the event. ----------------------------------- Jeffrey Scott Split Rock Communications 408-884-4017 408-348-3651 Mobile 408-884-3900 Fax www.SplitRockPR.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Sonoma 2009 Invite.pdf Type: application/pdf Size: 185685 bytes Desc: not available URL: From vlad at lists.openfabrics.org Fri Jan 23 03:12:24 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 23 Jan 2009 03:12:24 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090123-0200 daily build status Message-ID: <20090123111224.78073E60FC5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From devesh28 at gmail.com Fri Jan 23 06:16:19 2009 From: devesh28 at gmail.com (Devesh Sharma) Date: Fri, 23 Jan 2009 19:46:19 +0530 Subject: ***SPAM*** Re: [ofa-general] invitation to OFA Sonoma Workshop In-Reply-To: References: Message-ID: <309a667c0901230616r79e9aa3cnbf74095931d5b296@mail.gmail.com> Hello list, I have a query regarding paper publication, Is it possible to publish some work related to IB in this workshop? On Fri, Jan 23, 2009 at 10:55 AM, Jeffrey Scott wrote: > The OFA is hosting the 5th Annual International Sonoma Workshop from *March > 22-25* at The Lodge at Sonoma. To kickoff the event, OFA is hosting a special > presentation on the evening of March 22. Registration for the Sonoma > Workshop costs $595. *An Early Bird discount is available through > February 2**3.* *The discounted rate is $495**.* You may register for > the event and book a hotel room here > . > > > > An invitation to the event is attached. Please forward this to co-workers, > professional acquaintances, customers, partners and anyone else who might > be interested in attending the event. > > > > > > > > ----------------------------------- > > Jeffrey Scott > > Split Rock Communications > > > > 408-884-4017 > > 408-348-3651 Mobile > > 408-884-3900 Fax > > www.SplitRockPR.com > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From brian at sun.com Fri Jan 23 10:42:32 2009 From: brian at sun.com (Brian J. Murrell) Date: Fri, 23 Jan 2009 13:42:32 -0500 Subject: [ofa-general] OFED 1.4's autoconf.h conflicting with kernel Message-ID: <1232736152.7440.127.camel@pc.interlinx.bc.ca> So, I noticed that OFED 1.4 is bringing a whole lot more into it's /usr/src/ofa_kernel/include/linux/autoconf.h than 1.3.1 did. I find these additional defines with 1.4: #undef CONFIG_SUNRPC_XPRT_RDMA #undef CONFIG_SUNRPC #undef CONFIG_SUNRPC_GSS #undef CONFIG_RPCSEC_GSS_KRB5 #undef CONFIG_RPCSEC_GSS_SPKM3 #undef CONFIG_NFS_FS #undef CONFIG_NFS_V3 #undef CONFIG_NFS_V3_ACL #undef CONFIG_NFS_V4 #undef CONFIG_NFS_ACL_SUPPORT #undef CONFIG_NFS_DIRECTIO #undef CONFIG_SYSCTL #undef CONFIG_EXPORTFS #undef CONFIG_LOCKD #undef CONFIG_LOCKD_V4 #undef CONFIG_NFSD #undef CONFIG_NFSD_V2_ACL #undef CONFIG_NFSD_V3 #undef CONFIG_NFSD_V3_ACL #undef CONFIG_NFSD_V4 #undef CONFIG_NFSD_RDMA amongst a few others that I am far less concerned with. The problem is, that these are in direct conflict with what I have chosen for my kernel build. Some research has led me to a message (http://www.mail-archive.com/general at lists.openfabrics.org/msg18161.html) from Jeff Becker back on Thu, 10 Jul 2008 15:58:53 -0700 in which he submitted a patch to integrate NFSRDMA into OFED 1.4 which is what appears to have brought these changes into OFED 1.4. So, I guess I am wondering how can I build OFED 1.4, leaving out the NFSRDMA stuff and yet not override my kernel's config settings for all of the above variables? IOW, I don't want any of those #undefs in the OFED autoconf.h overriding the ones in my kernel's autoconf.h It seems with OFED the only two options for NFS are NFSRDMA or no NFS at all. There is no "use the kernel's NFS, unmodified" option. Beyond all of the NFSRDMA stuff, why is CONFIG_SYSCTL being drawn into all of this? Any insights would be appreciated. b. From Jeffrey.C.Becker at nasa.gov Fri Jan 23 11:01:49 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Fri, 23 Jan 2009 11:01:49 -0800 Subject: [ofa-general] OFED 1.4's autoconf.h conflicting with kernel In-Reply-To: <1232736152.7440.127.camel@pc.interlinx.bc.ca> References: <1232736152.7440.127.camel@pc.interlinx.bc.ca> Message-ID: <497A141D.90008@nasa.gov> Hi Brian Brian J. Murrell wrote: > So, I noticed that OFED 1.4 is bringing a whole lot more into > it's /usr/src/ofa_kernel/include/linux/autoconf.h than 1.3.1 did. I > find these additional defines with 1.4: > > #undef CONFIG_SUNRPC_XPRT_RDMA > #undef CONFIG_SUNRPC > #undef CONFIG_SUNRPC_GSS > #undef CONFIG_RPCSEC_GSS_KRB5 > #undef CONFIG_RPCSEC_GSS_SPKM3 > #undef CONFIG_NFS_FS > #undef CONFIG_NFS_V3 > #undef CONFIG_NFS_V3_ACL > #undef CONFIG_NFS_V4 > #undef CONFIG_NFS_ACL_SUPPORT > #undef CONFIG_NFS_DIRECTIO > #undef CONFIG_SYSCTL > #undef CONFIG_EXPORTFS > #undef CONFIG_LOCKD > #undef CONFIG_LOCKD_V4 > #undef CONFIG_NFSD > #undef CONFIG_NFSD_V2_ACL > #undef CONFIG_NFSD_V3 > #undef CONFIG_NFSD_V3_ACL > #undef CONFIG_NFSD_V4 > #undef CONFIG_NFSD_RDMA > > amongst a few others that I am far less concerned with. > > The problem is, that these are in direct conflict with what I have > chosen for my kernel build. > > Some research has led me to a message > (http://www.mail-archive.com/general at lists.openfabrics.org/msg18161.html) from Jeff Becker back on Thu, 10 Jul 2008 15:58:53 -0700 in which he submitted a patch to integrate NFSRDMA into OFED 1.4 which is what appears to have brought these changes into OFED 1.4. > > Yup - that was me. > So, I guess I am wondering how can I build OFED 1.4, leaving out the > NFSRDMA stuff and yet not override my kernel's config settings for all > of the above variables? IOW, I don't want any of those #undefs in the > OFED autoconf.h overriding the ones in my kernel's autoconf.h > > It seems with OFED the only two options for NFS are NFSRDMA or no NFS at > all. There is no "use the kernel's NFS, unmodified" option. > I usually build my kernel first (usually with NFS). Then I build OFED. If I turn off NFS(RDMA), it doesn't affect my kernel. I don't do this with the install.pl script, but rather by hand (running configure && make && make install in my OFED directory). > Beyond all of the NFSRDMA stuff, why is CONFIG_SYSCTL being drawn into > all of this? > I thought I needed it for proper configuration of NFSRDMA. > Any insights would be appreciated. > > b. > Please let me know if there are any other questions or issues. Thanks. -jeff > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From brian at sun.com Fri Jan 23 11:12:30 2009 From: brian at sun.com (Brian J. Murrell) Date: Fri, 23 Jan 2009 14:12:30 -0500 Subject: [ofa-general] OFED 1.4's autoconf.h conflicting with kernel In-Reply-To: <497A141D.90008@nasa.gov> References: <1232736152.7440.127.camel@pc.interlinx.bc.ca> <497A141D.90008@nasa.gov> Message-ID: <1232737950.7440.136.camel@pc.interlinx.bc.ca> On Fri, 2009-01-23 at 11:01 -0800, Jeff Becker wrote: > Hi Brian Hi Jeff, > Yup - that was me. /me waves jeff. :-) > I usually build my kernel first (usually with NFS). Then I build OFED. Same here. > If I turn off NFS(RDMA), it doesn't affect my kernel. No, but it will affect the next thing (i.e. kernel module(s)) you build with your kernel source and kernel-ib-devel. Perhaps I am doing this wrongly, but I have the kernel-ib-devel paths first (i.e. before the kernel ones) so the autoconf.h is being picked up there and the item of fallout I noticed first was that the bits in my module that wanted CONFIG_SYSCTL defined, broke. My kernel's autoconf.h defines it, but the OFED one (without nfsrdma) undefines it. > I don't do this > with the install.pl script, but rather by hand (running configure && > make && make install in my OFED directory). Yeah. I rpmbuild --rebuild the SRPM. Kernel and kernel-ib[-devel] go fine. It just this third module I want to build against them both that goes blooey. > I thought I needed it for proper configuration of NFSRDMA. It could be. But the problem is that it's used for all sorts of other stuff and #undef'ing it in the OFED autoconf.h contradicts what the kernel definition wants. Thots? b. From Jeffrey.C.Becker at nasa.gov Fri Jan 23 11:31:41 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Fri, 23 Jan 2009 11:31:41 -0800 Subject: [ofa-general] OFED 1.4's autoconf.h conflicting with kernel In-Reply-To: <1232737950.7440.136.camel@pc.interlinx.bc.ca> References: <1232736152.7440.127.camel@pc.interlinx.bc.ca> <497A141D.90008@nasa.gov> <1232737950.7440.136.camel@pc.interlinx.bc.ca> Message-ID: <497A1B1D.4050808@nasa.gov> Brian J. Murrell wrote: > On Fri, 2009-01-23 at 11:01 -0800, Jeff Becker wrote: > >> Hi Brian >> > > Hi Jeff, > > >> Yup - that was me. >> > > /me waves jeff. :-) > > >> I usually build my kernel first (usually with NFS). Then I build OFED. >> > > Same here. > > >> If I turn off NFS(RDMA), it doesn't affect my kernel. >> > > No, but it will affect the next thing (i.e. kernel module(s)) you build > with your kernel source and kernel-ib-devel. Perhaps I am doing this > wrongly, but I have the kernel-ib-devel paths first (i.e. before the > kernel ones) so the autoconf.h is being picked up there and the item of > fallout I noticed first was that the bits in my module that wanted > CONFIG_SYSCTL defined, broke. My kernel's autoconf.h defines it, but > the OFED one (without nfsrdma) undefines it. > > >> I don't do this >> with the install.pl script, but rather by hand (running configure && >> make && make install in my OFED directory). >> > > Yeah. I rpmbuild --rebuild the SRPM. Kernel and kernel-ib[-devel] go > fine. It just this third module I want to build against them both that > goes blooey. > > >> I thought I needed it for proper configuration of NFSRDMA. >> > > It could be. But the problem is that it's used for all sorts of other > stuff and #undef'ing it in the OFED autoconf.h contradicts what the > kernel definition wants. > > Thots? > I put it in, because I thought it would fix some trouble configuring sysctls and rpc debugging properly. I can check again to see if it's really needed. -jeff > b. > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From brian at sun.com Fri Jan 23 12:01:42 2009 From: brian at sun.com (Brian J. Murrell) Date: Fri, 23 Jan 2009 15:01:42 -0500 Subject: [ofa-general] OFED 1.4's autoconf.h conflicting with kernel In-Reply-To: <497A1B1D.4050808@nasa.gov> References: <1232736152.7440.127.camel@pc.interlinx.bc.ca> <497A141D.90008@nasa.gov> <1232737950.7440.136.camel@pc.interlinx.bc.ca> <497A1B1D.4050808@nasa.gov> Message-ID: <1232740902.7440.153.camel@pc.interlinx.bc.ca> On Fri, 2009-01-23 at 11:31 -0800, Jeff Becker wrote: > I put it in, because I thought it would fix some trouble configuring > sysctls and rpc debugging properly. Certainly, no argument there. > I can check again to see if it's > really needed. I don't have a problem with enabling of nfsrdma enabling that CONFIG_SYSCTL. My issue is with disabling of config items. In fact I've been scratching my brain trying to figure out this linux/autoconf.h stuff is supposed to work with third party modules. Here's my use case... I've built a kernel, now I build kernel-ib[-devel] against it with some chosen configuration items. Let's say for argument's sake I have chosen to disable nfsrdma in the build. Now I want to build another module which wants to use the OFED stack. It needs to pick up the OFED headers, of course, so I put: -I/usr/src/ofa_kernel/include/ at the beginning of my include search path -- because I want my module to find the new OFED headers before any that might be in my kernel source tree, yes? So now how does a module that wants to conditionally compile code based on a define in the kernel's linux/autoconf.h find the kernel definition of that file? i.e. --- module.c --- ... #include #include ... #ifdef CONFIG_SOMETHING_OR_OTHER #endif } ... ---------------- When the compile command for that module has -I/usr/src/ofa_kernel/include/ on it meaning it will find /usr/src/ofa_kernel/include/linux/linux/autoconf.h, not the one defined in the kernel's source pool. Thots? b. From chien.tin.tung at intel.com Fri Jan 23 13:24:45 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Fri, 23 Jan 2009 15:24:45 -0600 Subject: [ofa-general] [PATCH v2] RDMA/nes: Account for freed pbl after hw operation Message-ID: <20090123212445.GA6248@ctung-MOBL> From: Don Wood Fix occurrences where the software pbl counts were changed before the hardware was updated. This bug allowed another thread to overallocate the hardware resources. Add proper pbl accounting in case nes_reg_mr failed. Signed-off-by: Don Wood --- V2 change: Split bug fixes from "Improved use of pbls" patch. diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 4cfb4d9..e53f1ea 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -551,6 +551,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) struct nes_device *nesdev = nesvnic->nesdev; struct nes_adapter *nesadapter = nesdev->nesadapter; int i = 0; + int rc; /* free the resources */ if (nesfmr->leaf_pbl_cnt == 0) { @@ -572,6 +573,8 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) nesmr->ibmw.rkey = ibfmr->rkey; nesmr->ibmw.uobject = NULL; + rc = nes_dealloc_mw(&nesmr->ibmw); + if (nesfmr->nesmr.pbls_used != 0) { spin_lock_irqsave(&nesadapter->pbl_lock, flags); if (nesfmr->nesmr.pbl_4k) { @@ -584,7 +587,7 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); } - return nes_dealloc_mw(&nesmr->ibmw); + return rc; } @@ -1993,7 +1996,16 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, stag, ret, cqp_request->major_code, cqp_request->minor_code); major_code = cqp_request->major_code; nes_put_cqp_request(nesdev, cqp_request); - + if ((!ret || major_code) && pbl_count != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (pbl_count > 1) + nesadapter->free_4kpbl += pbl_count+1; + else if (residual_page_count > 32) + nesadapter->free_4kpbl += pbl_count; + else + nesadapter->free_256pbl += pbl_count; + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } if (!ret) return -ETIME; else if (major_code) @@ -2607,24 +2619,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) cqp_request->waiting = 1; cqp_wqe = &cqp_request->cqp_wqe; - spin_lock_irqsave(&nesadapter->pbl_lock, flags); - if (nesmr->pbls_used != 0) { - if (nesmr->pbl_4k) { - nesadapter->free_4kpbl += nesmr->pbls_used; - if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) { - printk(KERN_ERR PFX "free 4KB PBLs(%u) has exceeded the max(%u)\n", - nesadapter->free_4kpbl, nesadapter->max_4kpbl); - } - } else { - nesadapter->free_256pbl += nesmr->pbls_used; - if (nesadapter->free_256pbl > nesadapter->max_256pbl) { - printk(KERN_ERR PFX "free 256B PBLs(%u) has exceeded the max(%u)\n", - nesadapter->free_256pbl, nesadapter->max_256pbl); - } - } - } - - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); nes_fill_init_cqp_wqe(cqp_wqe, nesdev); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_OPCODE_IDX, NES_CQP_DEALLOCATE_STAG | NES_CQP_STAG_VA_TO | @@ -2642,11 +2636,6 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) " CQP Major:Minor codes = 0x%04X:0x%04X\n", ib_mr->rkey, ret, cqp_request->major_code, cqp_request->minor_code); - nes_free_resource(nesadapter, nesadapter->allocated_mrs, - (ib_mr->rkey & 0x0fffff00) >> 8); - - kfree(nesmr); - major_code = cqp_request->major_code; minor_code = cqp_request->minor_code; @@ -2662,8 +2651,33 @@ static int nes_dereg_mr(struct ib_mr *ib_mr) " to destroy STag, ib_mr=%p, rkey = 0x%08X\n", major_code, minor_code, ib_mr, ib_mr->rkey); return -EIO; - } else - return 0; + } + + if (nesmr->pbls_used != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (nesmr->pbl_4k) { + nesadapter->free_4kpbl += nesmr->pbls_used; + if (nesadapter->free_4kpbl > nesadapter->max_4kpbl) + printk(KERN_ERR PFX "free 4KB PBLs(%u) has " + "exceeded the max(%u)\n", + nesadapter->free_4kpbl, + nesadapter->max_4kpbl); + } else { + nesadapter->free_256pbl += nesmr->pbls_used; + if (nesadapter->free_256pbl > nesadapter->max_256pbl) + printk(KERN_ERR PFX "free 256B PBLs(%u) has " + "exceeded the max(%u)\n", + nesadapter->free_256pbl, + nesadapter->max_256pbl); + } + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } + nes_free_resource(nesadapter, nesadapter->allocated_mrs, + (ib_mr->rkey & 0x0fffff00) >> 8); + + kfree(nesmr); + + return 0; } -- 1.5.3.3 From chien.tin.tung at intel.com Fri Jan 23 13:24:48 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Fri, 23 Jan 2009 15:24:48 -0600 Subject: [ofa-general] [PATCH v2] RDMA/nes: Improved use of pbls Message-ID: <20090123212448.GA7968@ctung-MOBL> From: Don Wood Two level 256 byte pbls was not implemented so the driver could report out of memory when in fact there were pbls still available. The solution prefers to use 4KB pbls over two level 256B pbls until the number of 4KB pbls falls below a threshold. At this point the 4KB pbl structure is converted to use 256B pbls which prevents the driver from running out of 4KB pbls too quickly. Signed-off-by: Don Wood --- V2 change: Split bug fixes from "Improved use of pbls" patch. diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index e53f1ea..04d2f56 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1887,21 +1887,75 @@ static int nes_destroy_cq(struct ib_cq *ib_cq) return ret; } +/** + * root_256 + */ +static u32 root_256(struct nes_device *nesdev, + struct nes_root_vpbl *root_vpbl, + struct nes_root_vpbl *new_root, + u16 pbl_count_4k, + u16 pbl_count_256) +{ + u64 leaf_pbl; + int i, j, k; + + if (pbl_count_4k == 1) { + new_root->pbl_vbase = pci_alloc_consistent(nesdev->pcidev, + 512, &new_root->pbl_pbase); + + if (new_root->pbl_vbase == NULL) + return 0; + + leaf_pbl = (u64)root_vpbl->pbl_pbase; + for (i = 0; i < 16; i++) { + new_root->pbl_vbase[i].pa_low = + cpu_to_le32((u32)leaf_pbl); + new_root->pbl_vbase[i].pa_high = + cpu_to_le32((u32)((((u64)leaf_pbl) >> 32))); + leaf_pbl += 256; + } + } else { + for (i = 3; i >= 0; i--) { + j = i * 16; + root_vpbl->pbl_vbase[j] = root_vpbl->pbl_vbase[i]; + leaf_pbl = le32_to_cpu(root_vpbl->pbl_vbase[j].pa_low) + + (((u64)le32_to_cpu(root_vpbl->pbl_vbase[j].pa_high)) + << 32); + for (k = 1; k < 16; k++) { + leaf_pbl += 256; + root_vpbl->pbl_vbase[j + k].pa_low = + cpu_to_le32((u32)leaf_pbl); + root_vpbl->pbl_vbase[j + k].pa_high = + cpu_to_le32((u32)((((u64)leaf_pbl) >> 32))); + } + } + } + + return 1; +} + /** * nes_reg_mr */ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, u32 stag, u64 region_length, struct nes_root_vpbl *root_vpbl, - dma_addr_t single_buffer, u16 pbl_count, u16 residual_page_count, - int acc, u64 *iova_start) + dma_addr_t single_buffer, u16 pbl_count_4k, + u16 residual_page_count_4k, int acc, u64 *iova_start, + u16 *actual_pbl_cnt, u8 *used_4k_pbls) { struct nes_hw_cqp_wqe *cqp_wqe; struct nes_cqp_request *cqp_request; unsigned long flags; int ret; struct nes_adapter *nesadapter = nesdev->nesadapter; - /* int count; */ + uint pg_cnt = 0; + u16 pbl_count_256; + u16 pbl_count = 0; + u8 use_256_pbls = 0; + u8 use_4k_pbls = 0; + u16 use_two_level = (pbl_count_4k > 1) ? 1 : 0; + struct nes_root_vpbl new_root = {0, 0, 0}; u32 opcode = 0; u16 major_code; @@ -1914,41 +1968,70 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, cqp_request->waiting = 1; cqp_wqe = &cqp_request->cqp_wqe; - spin_lock_irqsave(&nesadapter->pbl_lock, flags); - /* track PBL resources */ - if (pbl_count != 0) { - if (pbl_count > 1) { - /* Two level PBL */ - if ((pbl_count+1) > nesadapter->free_4kpbl) { - nes_debug(NES_DBG_MR, "Out of 4KB Pbls for two level request.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_4kpbl -= pbl_count+1; - } - } else if (residual_page_count > 32) { - if (pbl_count > nesadapter->free_4kpbl) { - nes_debug(NES_DBG_MR, "Out of 4KB Pbls.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_4kpbl -= pbl_count; + if (pbl_count_4k) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + + pg_cnt = ((pbl_count_4k - 1) * 512) + residual_page_count_4k; + pbl_count_256 = (pg_cnt + 31) / 32; + if (pg_cnt <= 32) { + if (pbl_count_256 <= nesadapter->free_256pbl) + use_256_pbls = 1; + else if (pbl_count_4k <= nesadapter->free_4kpbl) + use_4k_pbls = 1; + } else if (pg_cnt <= 2048) { + if (((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) && + (nesadapter->free_4kpbl > (nesadapter->max_4kpbl >> 1))) { + use_4k_pbls = 1; + } else if ((pbl_count_256 + 1) <= nesadapter->free_256pbl) { + use_256_pbls = 1; + use_two_level = 1; + } else if ((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) { + use_4k_pbls = 1; } } else { - if (pbl_count > nesadapter->free_256pbl) { - nes_debug(NES_DBG_MR, "Out of 256B Pbls.\n"); - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); - nes_free_cqp_request(nesdev, cqp_request); - return -ENOMEM; - } else { - nesadapter->free_256pbl -= pbl_count; - } + if ((pbl_count_4k + 1) <= nesadapter->free_4kpbl) + use_4k_pbls = 1; + } + + if (use_256_pbls) { + pbl_count = pbl_count_256; + nesadapter->free_256pbl -= pbl_count + use_two_level; + } else if (use_4k_pbls) { + pbl_count = pbl_count_4k; + nesadapter->free_4kpbl -= pbl_count + use_two_level; + } else { + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + nes_debug(NES_DBG_MR, "Out of Pbls\n"); + nes_free_cqp_request(nesdev, cqp_request); + return -ENOMEM; } + + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); } - spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + if (use_256_pbls && use_two_level) { + if (root_256(nesdev, root_vpbl, &new_root, pbl_count_4k, pbl_count_256) == 1) { + if (new_root.pbl_pbase != 0) + root_vpbl = &new_root; + } else { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + nesadapter->free_256pbl += pbl_count_256 + use_two_level; + use_256_pbls = 0; + + if (pbl_count_4k == 1) + use_two_level = 0; + pbl_count = pbl_count_4k; + + if ((pbl_count_4k + use_two_level) <= nesadapter->free_4kpbl) { + nesadapter->free_4kpbl -= pbl_count + use_two_level; + use_4k_pbls = 1; + } + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + + if (use_4k_pbls == 0) + return -ENOMEM; + } + } opcode = NES_CQP_REGISTER_STAG | NES_CQP_STAG_RIGHTS_LOCAL_READ | NES_CQP_STAG_VA_TO | NES_CQP_STAG_MR; @@ -1977,10 +2060,9 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, } else { set_wqe_64bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PA_LOW_IDX, root_vpbl->pbl_pbase); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_BLK_COUNT_IDX, pbl_count); - set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_LEN_IDX, - (((pbl_count - 1) * 4096) + (residual_page_count*8))); + set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_STAG_WQE_PBL_LEN_IDX, (pg_cnt * 8)); - if ((pbl_count > 1) || (residual_page_count > 32)) + if (use_4k_pbls) cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] |= cpu_to_le32(NES_CQP_STAG_PBL_BLK_SIZE); } barrier(); @@ -1996,23 +2078,26 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, stag, ret, cqp_request->major_code, cqp_request->minor_code); major_code = cqp_request->major_code; nes_put_cqp_request(nesdev, cqp_request); + if ((!ret || major_code) && pbl_count != 0) { spin_lock_irqsave(&nesadapter->pbl_lock, flags); - if (pbl_count > 1) - nesadapter->free_4kpbl += pbl_count+1; - else if (residual_page_count > 32) - nesadapter->free_4kpbl += pbl_count; - else - nesadapter->free_256pbl += pbl_count; + if (use_256_pbls) + nesadapter->free_256pbl += pbl_count + use_two_level; + else if (use_4k_pbls) + nesadapter->free_4kpbl += pbl_count + use_two_level; spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); } + if (new_root.pbl_pbase) + pci_free_consistent(nesdev->pcidev, 512, new_root.pbl_vbase, + new_root.pbl_pbase); + if (!ret) return -ETIME; else if (major_code) return -EIO; - else - return 0; + *actual_pbl_cnt = pbl_count + use_two_level; + *used_4k_pbls = use_4k_pbls; return 0; } @@ -2177,18 +2262,14 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, pbl_count = root_pbl_index; } ret = nes_reg_mr(nesdev, nespd, stag, region_length, &root_vpbl, - buffer_list[0].addr, pbl_count, (u16)cur_pbl_index, acc, iova_start); + buffer_list[0].addr, pbl_count, (u16)cur_pbl_index, acc, iova_start, + &nesmr->pbls_used, &nesmr->pbl_4k); if (ret == 0) { nesmr->ibmr.rkey = stag; nesmr->ibmr.lkey = stag; nesmr->mode = IWNES_MEMREG_TYPE_MEM; ibmr = &nesmr->ibmr; - nesmr->pbl_4k = ((pbl_count > 1) || (cur_pbl_index > 32)) ? 1 : 0; - nesmr->pbls_used = pbl_count; - if (pbl_count > 1) { - nesmr->pbls_used++; - } } else { kfree(nesmr); ibmr = ERR_PTR(-ENOMEM); @@ -2466,8 +2547,9 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, stag, (unsigned int)iova_start, (unsigned int)region_length, stag_index, (unsigned long long)region->length, pbl_count); - ret = nes_reg_mr( nesdev, nespd, stag, region->length, &root_vpbl, - first_dma_addr, pbl_count, (u16)cur_pbl_index, acc, &iova_start); + ret = nes_reg_mr(nesdev, nespd, stag, region->length, &root_vpbl, + first_dma_addr, pbl_count, (u16)cur_pbl_index, acc, + &iova_start, &nesmr->pbls_used, &nesmr->pbl_4k); nes_debug(NES_DBG_MR, "ret=%d\n", ret); @@ -2476,11 +2558,6 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, nesmr->ibmr.lkey = stag; nesmr->mode = IWNES_MEMREG_TYPE_MEM; ibmr = &nesmr->ibmr; - nesmr->pbl_4k = ((pbl_count > 1) || (cur_pbl_index > 32)) ? 1 : 0; - nesmr->pbls_used = pbl_count; - if (pbl_count > 1) { - nesmr->pbls_used++; - } } else { ib_umem_release(region); kfree(nesmr); -- 1.5.3.3 From weiny2 at llnl.gov Fri Jan 23 14:20:16 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 23 Jan 2009 14:20:16 -0800 Subject: [ofa-general] Re: [PATCH V2 1/3] Create a new library libibnetdisc In-Reply-To: <20090121120727.GB3479@sashak.voltaire.com> References: <20081211162031.0c591f54.weiny2@llnl.gov> <20081221152100.GN25208@sashak.voltaire.com> <20081223113449.7d6c629b.weiny2@llnl.gov> <20090121120727.GB3479@sashak.voltaire.com> Message-ID: <20090123142016.320aebdf.weiny2@llnl.gov> Sasha, It sounds like we are in agreement with this architecture. > > +--------+ > > -| diag 1 |- > > / +--------+ \ > > +-----------------+ +--------+ +-----------------+ > > | libibnetdisc | -| diag 2 |--| libibmad | > > +-----------------+ +--------+ +-----------------+ > > \ / > > -------------------------- I am going to integrate your comments with a reworked version of the patches. In the mean time I and/or Al will be submitting some patches for the libibmad interface. Ira On Wed, 21 Jan 2009 14:07:27 +0200 Sasha Khapyorsky wrote: > Hi Ira, > > On 11:34 Tue 23 Dec , Ira Weiny wrote: > > > > > > What is the reason to redeclear custom NodeInfo and PortInfo structures? > > > The original are defined by IBA and there are lot of utilities to work > > > with them. Wouldn't it be better to use it as is? > > > > > > > [DISCLAIMER] First I want to answer your questions directly. However, after > > writing this information, discussing with Al, and thinking it > > over. I think I see where you are coming from and I _may_ agree > > with you. So after reading these responses please read my > > thoughts regarding libibmad and this new lib. > > > > > > I have 3 reasons I did it this way: > > > > 1) This is pretty much the way that ibnetdiscover did things > > (By using mad_decode_field into these single fields) > > > > 2) This makes libibnetdisc only dependent on libibmad rather than the OpenSM > > libs. We have had some people complain that the diags require opensm-* > > things to be installed. (This assumes you want to use ib_port_info_t from > > ib_types.h) > > I thought about using libibmad rather than OpenSM stuff (in > libibnetdisc), so PortInfo, NodeInfo and other SM attributes will be > represented just as raw data and libibmad (or any other MAD interpreter) > will be used by diag/whatever tool which uses libibnetdesc for decoding. > > > 3) This structure is in host byte order and calls out each field > > independently rather than having to have intimate knowledge of the > > PortInfo wire packet. > > > > For example this is the code used in iblinkinfo with the above structure. > > > > > > snprintf(link_str, 256, > > "(%3s %s %6s/%8s)", > > ibnd_linkwidth_str(port->info.link_width_active), > > ibnd_linkspeed_str(port->info.link_speed_active, 1), > > ibnd_linkstate_str(port->info.link_state), > > ibnd_physstate_str(port->info.phys_state) > > ); > > > > Here is the code if you use the ib_port_info_t from ib_types.h > > > > snprintf(link_str, 256, > > "(%3s %s %6s/%8s)", > > ibnd_linkwidth_str(port->info.link_width_active), > > ibnd_linkspeed_str(IB_PORT_LINK_SPEED_ACTIVE_MASK(port->info.link_speed), 1), > > ibnd_linkstate_str(IB_PORT_STATE_MASK(port->info.state_info1)), > > ibnd_physstate_str(IB_PORT_PHYS_STATE_MASK(port->info.state_info2) >> IB_PORT_PHYS_STATE_SHIFT) > > // ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > // This is particularly nasty compared to the above. > > ); > > > > I no longer agree with reason 1 and 2. > > Ok. > > > However, reason 3 I believe is enough > > justification to declare a new type. > > > > [DISCLAIMER] item 3 might be a mute point as well if you redefine what > > libibnetdisc is supposed to be. See below. > > Ok. See below. > > > > > + * Str convert functions > > > > + */ > > > > +char *ibnd_linkwidth_str(int link_width); > > > > +char *ibnd_linkstate_str(int link_state); > > > > +char *ibnd_physstate_str(int phys_state); > > > > +const char *ibnd_node_type_str(ibnd_node_t *node); > > > > +const char *ibnd_node_type_str_short(ibnd_node_t *node); > > > > +char *ibnd_linkspeed_str(int link_speed, int data_rate); > > > > + /* if data_rate == 0 use "SDR", "DDR", etc. */ > > > > + /* if data_rate == 1 use "2.5 Gbps", "5.0 Gbps", etc. */ > > > > > > Similar functions exist in libibmad. Why do we need another set? > > > > 2 reasons. > > > > 1) The strings returned are not compatible with the current output of > > ibnetdiscover and iblinkinfo... I was trying to make sure that the > > library returned string which were backwards compatible. That is > > actually the reason for the extra "data_rate" parameter of linkspeed. > > iblinkinfo and ibnetdiscover print this differently. :-( > > > > 2) But more importantly this is an ease of use issue. > > > > This: > > > > snprintf(link_str, 256, > > "(%3s %s %6s/%8s)", > > ibnd_linkwidth_str(port->info.link_width_active), > > ibnd_linkspeed_str(port->info.link_speed_active, 1), > > ibnd_linkstate_str(port->info.link_state), > > ibnd_physstate_str(port->info.phys_state) > > ); > > > > Becomes this: > > > > char buf[256]; > > ... > > snprintf(link_str, 256, > > "(%3s %s %6s/%8s)", > > mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, buf, 256, &port->info.link_width_active); > > mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, buf, 256, &port->info.link_speed_active); > > // ^^^^^^^^^^^^^^^^^ > > // Not backwards compatible with the current ibnetdiscover > > // as it prints the data as "2.5 Gbps" rather than "SDR" > > mad_dump_val(IB_PORT_STATE_F, buf, 256, &port->info.link_state); > > mad_dump_val(IB_PORT_PHYS_STATE_F, buf, 256, > > &port->info.phys_state);); > > > > Users don't need to go look up in mad.h for the field enum to print > > something they already have; "link_width_active". > > > > > > Anyway, I think I am starting to see the difference in what we are thinking... > > > > The ibnd_*_str functions and the ibnd_port_info_t were designed based on > > libibnetdisc being a "one stop shop" for this data. I envisioned this library > > being a wrapper around lower level libraries which would abstract away some > > details, something like this. > > > > +----------+ +----------+ > > | diag1 | | diag2 | > > +----------+ +----------+ > > | | > > +-----------------+ > > | libibnetdisc | > > +-----------------+ > > | > > +-----------------+ > > | libibmad | > > +-----------------+ > > > > I think what you had in mind was something like: > > > > +--------+ > > -| diag 1 |- > > / +--------+ \ > > +-----------------+ +--------+ +-----------------+ > > | libibnetdisc | -| diag 2 |--| libibmad | > > +-----------------+ +--------+ +-----------------+ > > \ / > > -------------------------- > > > > In this case users of libibnetdisc might get back something like: > > > > typedef struct port { > > uint64_t guid; > > int portnum; > > int ext_portnum; /* optional if != 0 external port num */ > > ibnd_node_t *node; /* node this port belongs to */ > > struct port *remoteport; /* null if SMA, or does not exist */ > > void *port_info; /* or uint8_t port_info[port_info_size] */ > > } ibnd_port_t; > > > > and decode port_info like this: > > > > uint32_t lid = mad_get_field(port->port_info, 0, IB_PORT_LID_F); > > mad_dump_val(IB_PORT_LID_F, port->port_info, &lid); > > > > Is that what you are thinking? > > Yes. > > > If this is the case I don't think I object. I > > think it makes the end user of libibnetdisc work harder but it does offer some > > advantages, namely less redefinition and a bit more flexibility. > > More flexibility is even more important IMO. > > > That said, I would like to clean up the mad interface at the same time. > > Which is good thing by itself :) . I absolutely agree that we can > improve our helpers. > > > Just > > figuring out the examples to write in this email have taken a lot of time. I > > don't think this is a good thing. > > > > Here are some examples: > > > > add something like: > > > > static inline char * > > mad_snprint_field(uint8_t *buf, int base_offs, int field, > > char *print_buf, int print_buf_size) > > > > Therefore the above could be used in a print statement like: > > > > char tmp[256]; > > printf("lid %s\n", mad_snprint_field(port->port_info, 0, IB_PORT_LID_F, tmp, > > 256)); > > > > [Although lid is a bad example since it could be done with "%d"... But you > > get what I mean.] > > Agree. > > > > > And along those lines the difference between mad_dump_field and mad_dump_val > > needs to be made more clear. They have the same signature but one has a lot of > > formating added to it which I don't think is appropriate at this level. > > > > "LinkState:.......................Active" > > vs. > > "Active" > > Why not? I think it was designed to be just different levels. Maybe you > are about unclear naming? > > > Also, I don't think that the following declarations need to be public. > > > > /* dump.c */ > > ib_mad_dump_fn > > mad_dump_int, mad_dump_uint, mad_dump_hex, mad_dump_rhex, > > mad_dump_bitfield, mad_dump_array, mad_dump_string, > > mad_dump_linkwidth, mad_dump_linkwidthsup, mad_dump_linkwidthen, > > mad_dump_linkdowndefstate, > > mad_dump_linkspeed, mad_dump_linkspeedsup, mad_dump_linkspeeden, > > mad_dump_portstate, mad_dump_portstates, > > mad_dump_physportstate, mad_dump_portcapmask, > > mad_dump_mtu, mad_dump_vlcap, mad_dump_opervls, > > mad_dump_node_type, > > mad_dump_sltovl, mad_dump_vlarbitration, > > mad_dump_nodedesc, mad_dump_nodeinfo, mad_dump_portinfo, mad_dump_switchinfo, > > mad_dump_perfcounters, mad_dump_perfcounters_ext; > > Again, why not? Those helpers are widely used now in infiniband-diags. > > > > > int _mad_dump(ib_mad_dump_fn *fn, char *name, void *val, int valsz); > > char * _mad_dump_field(ib_field_t *f, char *name, char *buf, int bufsz, > > void *val); > > int _mad_print_field(ib_field_t *f, char *name, void *val, int valsz); > > char * _mad_dump_val(ib_field_t *f, char *buf, int bufsz, void *val); > > > > They confuse the ibmad layer. > > I think there are two (at least) levels - lower is for more flexibility > and higher is for easy to use. > > Of course I agree that some improvements could be very useful - for > example doing 'ib_mad_dump_fn' to return number of printed characters > (similar to proposed mad_snprint_field()). > > > If this is what you would like I will rework the library. > > If we are agreed. > > > Perhaps starting to > > clean up libibmad along the way? > > In parallel... :) > > Sasha From vlad at lists.openfabrics.org Sat Jan 24 03:12:16 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 24 Jan 2009 03:12:16 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090124-0200 daily build status Message-ID: <20090124111217.0A39DE60ED0@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Sat Jan 24 05:58:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 24 Jan 2009 15:58:08 +0200 Subject: [ofa-general] Re: [PATCH V2 1/3] Create a new library libibnetdisc In-Reply-To: <20090123142016.320aebdf.weiny2@llnl.gov> References: <20081211162031.0c591f54.weiny2@llnl.gov> <20081221152100.GN25208@sashak.voltaire.com> <20081223113449.7d6c629b.weiny2@llnl.gov> <20090121120727.GB3479@sashak.voltaire.com> <20090123142016.320aebdf.weiny2@llnl.gov> Message-ID: <20090124135808.GA18502@sashak.voltaire.com> Hi Ira, On 14:20 Fri 23 Jan , Ira Weiny wrote: > Sasha, > > It sounds like we are in agreement with this architecture. > > > > +--------+ > > > -| diag 1 |- > > > / +--------+ \ > > > +-----------------+ +--------+ +-----------------+ > > > | libibnetdisc | -| diag 2 |--| libibmad | > > > +-----------------+ +--------+ +-----------------+ > > > \ / > > > -------------------------- > > I am going to integrate your comments with a reworked version of the patches. > In the mean time I and/or Al will be submitting some patches for the libibmad > interface. Sure. Thanks! Sasha From hbisa at us.ibm.com Sat Jan 24 08:29:08 2009 From: hbisa at us.ibm.com (Hakem B Isa) Date: Sat, 24 Jan 2009 11:29:08 -0500 Subject: [ofa-general] ***SPAM*** AUTO: Hakem B Isa/Cranford/IBM is out of the office. (returning 01/26/2009) Message-ID: I am out of the office until 01/26/2009. I am out of the office . Please contact the following : Gunther R. Schmidt IBM System x Technical Specialist for CitiGroup Office: 212-745-2306 Cell: 1-917-816-3958 grschmid at us.ibm.com ---------------------------------------------------------------------------------------------OR Jim Herrschaft SSM, Citi IA Team Cell: 914-261-1665 Note: This is an automated response to your message "general Digest, Vol 24, Issue 69" sent on 1/23/09 15:00:04. This is the only notification you will receive while this person is away. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Sun Jan 25 01:43:19 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 25 Jan 2009 11:43:19 +0200 Subject: [ofa-general] Re: [ewg] Re: [PATCH v1] mlx4_ib: Optimize hugetlab pages support In-Reply-To: References: <20090122163105.GA23496@mtls03> Message-ID: <20090125094319.GA19411@mtls03> On Thu, Jan 22, 2009 at 09:07:41PM -0800, Roland Dreier wrote: > > seems this might underestimate by 1 if the region doesn't start/end on a > huge-page aligned boundary (but we would still want to use big pages to > register it). > Looks like we must pass the virtual address through struct ib_umem to the low level driver. > > I think we could avoid the uninitialized_var() stuff and having restart > at all by just doing cur_size = 0 at the start of the loop, and then > instead of if (restart) just test if cur_size is 0. > initializing cur_size and eliminating restart works fine but cur_addr still needs this trick. I am sending two patches, one for ib_core and one for mlx4. From eli at mellanox.co.il Sun Jan 25 01:45:06 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 25 Jan 2009 11:45:06 +0200 Subject: [ofa-general] [PATCH] ib_core: save process's virtual address in struct ib_umem Message-ID: <20090125094506.GA19444@mtls03> add "address" field to struct ib_umem so low level drivers will have this information which may be needed in order to correctly calculate the number of huge pages. Signed-off-by: Eli Cohen --- drivers/infiniband/core/umem.c | 1 + include/rdma/ib_umem.h | 1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 6f7c096..4c076c4 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -102,6 +102,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->context = context; umem->length = size; umem->offset = addr & ~PAGE_MASK; + umem->address = addr; umem->page_size = PAGE_SIZE; /* * We ask for writable memory if any access flags other than diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 9ee0d2e..c385bb6 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -43,6 +43,7 @@ struct ib_umem { struct ib_ucontext *context; size_t length; int offset; + unsigned long address; int page_size; int writable; int hugetlb; -- 1.6.1 From eli at mellanox.co.il Sun Jan 25 01:45:46 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 25 Jan 2009 11:45:46 +0200 Subject: [ofa-general] [PATCH v2] mlx4_ib: Optimize hugetlab pages support Message-ID: <20090125094546.GA19466@mtls03> Since Linux may not merge adjacent pages into a single scatter entry through calls to dma_map_sg(), we check the special case of hugetlb pages which are likely to be mapped to coniguous dma addresses and if they are, take advantage of this. This will result in a significantly lower number of MTT segments used for registering hugetlb memory regions. Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mlx4/mr.c | 80 ++++++++++++++++++++++++++++++++++---- 1 files changed, 71 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 8e4d26d..7748823 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -119,6 +119,65 @@ out: return err; } +static int handle_hugetlb_user_mr(struct ib_pd *pd, struct mlx4_ib_mr *mr, + u64 virt_addr, int access_flags) +{ +#ifdef CONFIG_HUGETLB_PAGE + struct mlx4_ib_dev *dev = to_mdev(pd->device); + struct ib_umem_chunk *chunk; + unsigned dsize; + dma_addr_t daddr; + unsigned cur_size = 0; + dma_addr_t uninitialized_var(cur_addr); + int n; + struct ib_umem *umem = mr->umem; + u64 *arr; + int err = 0; + int i; + int j = 0; + + n = PAGE_ALIGN(umem->length + (umem->address & ~HPAGE_MASK)) >> HPAGE_SHIFT; + arr = kmalloc(n * sizeof *arr, GFP_KERNEL); + if (!arr) + return -ENOMEM; + + list_for_each_entry(chunk, &umem->chunk_list, list) + for (i = 0; i < chunk->nmap; ++i) { + daddr = sg_dma_address(&chunk->page_list[i]); + dsize = sg_dma_len(&chunk->page_list[i]); + if (!cur_size) { + cur_addr = daddr; + cur_size = dsize; + } else if (cur_addr + cur_size != daddr) { + err = -EINVAL; + goto out; + } else + cur_size += dsize; + + if (cur_size > HPAGE_SIZE) { + err = -EINVAL; + goto out; + } else if (cur_size == HPAGE_SIZE) { + cur_size = 0; + arr[j++] = cur_addr; + } + } + + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, umem->length, + convert_access(access_flags), n, HPAGE_SHIFT, &mr->mmr); + if (err) + goto out; + + err = mlx4_write_mtt(dev->dev, &mr->mmr.mtt, 0, n, arr); + +out: + kfree(arr); + return err; +#else + return -ENOSYS; +#endif +} + struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt_addr, int access_flags, struct ib_udata *udata) @@ -140,17 +199,20 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, goto err_free; } - n = ib_umem_page_count(mr->umem); - shift = ilog2(mr->umem->page_size); + if (!mr->umem->hugetlb || + handle_hugetlb_user_mr(pd, mr, virt_addr, access_flags)) { + n = ib_umem_page_count(mr->umem); + shift = ilog2(mr->umem->page_size); - err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, - convert_access(access_flags), n, shift, &mr->mmr); - if (err) - goto err_umem; + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, + convert_access(access_flags), n, shift, &mr->mmr); + if (err) + goto err_umem; - err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); - if (err) - goto err_mr; + err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); + if (err) + goto err_mr; + } err = mlx4_mr_enable(dev->dev, &mr->mmr); if (err) -- 1.6.1 From sashak at voltaire.com Sun Jan 25 02:07:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 25 Jan 2009 12:07:04 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/dump_lfts.sh: fix -D format parsing In-Reply-To: <20090109154759.3f5d97b2.weiny2@llnl.gov> References: <20090109154759.3f5d97b2.weiny2@llnl.gov> Message-ID: <20090125100704.GA20261@sashak.voltaire.com> DR format was changed slightly - fix format parsing accordingly. Also as Ira suggested use '-s' (show) option with ibnetdiscover instead of -v (verbose) - it produces less output. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/scripts/dump_lfts.sh | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/infiniband-diags/scripts/dump_lfts.sh b/infiniband-diags/scripts/dump_lfts.sh index ebca705..a07c211 100755 --- a/infiniband-diags/scripts/dump_lfts.sh +++ b/infiniband-diags/scripts/dump_lfts.sh @@ -22,8 +22,8 @@ done dump_by_dr_path () { -for sw_dr in `ibnetdiscover $ca_info -v \ - | sed -ne '/^DR path .* switch /s/^DR path \([,|0-9]\+\) ->.*{\([0-9|a-f]\+\)}.*$/\2 \1/p' \ +for sw_dr in `ibnetdiscover $ca_info -s \ + | sed -ne '/^DR path .* switch /s/^DR path .*; \([,|0-9]\+\) ->.*{\([0-9|a-f]\+\)}.*$/\2 \1/p' \ | sort -u \ | awk 'BEGIN {guid=0;} {if ($1 != guid) { guid=$1; print $2; }}'` ; do ibroute $ca_info -D ${sw_dr} -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Sun Jan 25 02:07:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 25 Jan 2009 12:07:55 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/dump_mfts.sh: fix -D format parsing In-Reply-To: <20090125100704.GA20261@sashak.voltaire.com> References: <20090109154759.3f5d97b2.weiny2@llnl.gov> <20090125100704.GA20261@sashak.voltaire.com> Message-ID: <20090125100755.GB20261@sashak.voltaire.com> The fixes are similar to those were done in dump_lfts.sh, plus all previous adoption and fixes. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/scripts/dump_mfts.sh | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/scripts/dump_mfts.sh b/infiniband-diags/scripts/dump_mfts.sh index 39fc5fb..cdadba2 100755 --- a/infiniband-diags/scripts/dump_mfts.sh +++ b/infiniband-diags/scripts/dump_mfts.sh @@ -22,10 +22,10 @@ done dump_by_dr_path () { -for sw_dr in `ibnetdiscover $ca_info -v \ - | sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \ - | sed -e 's/\]\[/,/g' \ - | sort -u` ; do +for sw_dr in `ibnetdiscover $ca_info -s \ + | sed -ne '/^DR path .* switch /s/^DR path .*; \([,|0-9]\+\) ->.*{\([0-9|a-f]\+\)}.*$/\2 \1/p' \ + | sort -u \ + | awk 'BEGIN {guid=0;} {if ($1 != guid) { guid=$1; print $2; }}'` ; do ibroute $ca_info -M -D ${sw_dr} done } -- 1.6.0.4.766.g6fc4a From tziporet at dev.mellanox.co.il Sun Jan 25 02:48:48 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 25 Jan 2009 12:48:48 +0200 Subject: [ofa-general] Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F41C6FBC34@orsmsx507.amr.corp.intel.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> <4978EE0E.5050209@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBC34@orsmsx507.amr.corp.intel.com> Message-ID: <497C4390.4090209@mellanox.co.il> Woodruff, Robert J wrote: > Personally I do not have a problem with including it, since MPI is > an isolated component and does not effect the core stack, > but I thought that we had discussed in Sonoma last year > not including major new features in point releases to > reduce the QA that is needed. And, in general I think that > is the way that kernel.org works, point releases are just for > bug fixes. > > In any case, lets discuss it again in the EWG on Monday. > > I will add this to the agenda Note that we will start working to add RH 5.3 backports now to see how much effort it is Tziporet From vlad at lists.openfabrics.org Sun Jan 25 02:54:27 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 25 Jan 2009 02:54:27 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090125-0112 daily build status Message-ID: <20090125105427.49E80E61198@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Sun Jan 25 03:51:36 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 25 Jan 2009 13:51:36 +0200 Subject: [ofa-general] [PATCH] infiniband-diags: command line option processing framework Message-ID: <20090125115136.GA20419@sashak.voltaire.com> The main motivation of this is to unify infiniband-diags command line options and tools usage. Also it simplifies programming and can remove a lot of duplications. The usage message is also unified over all tools and looks like: Usage: ibaddr [options] [] Options: --gid_show, -g show gid address only --lid_show, -l show lid range only --Lid_show, -L show lid range (in decimal) only --Ca, -C Ca name to use --Port, -P Ca port number to use --Direct, -D use Direct address argument --Guid, -G use GUID address argument --timeout, -t timeout in ms --sm_port, -s SM port lid --errors, -e show send and receive errors --verbose, -v increase verbosity level --debug, -d raise debug level --usage, -u usage message --help, -h help message --version, -V show version Examples: ibaddr # local port's address ibaddr 32 # show lid range and gid of lid 32 ibaddr -G 0x8f1040023 # same but using guid address ibaddr -l 32 # show lid range only ibaddr -L 32 # show decimal lid range only ibaddr -g 32 # show gid address only Custom (per tool) option processing is also supported. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/include/ibdiag_common.h | 28 +++- infiniband-diags/src/ibdiag_common.c | 248 ++++++++++++++++++++++++++++++ 2 files changed, 274 insertions(+), 2 deletions(-) diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h index 0518579..4304826 100644 --- a/infiniband-diags/include/ibdiag_common.h +++ b/infiniband-diags/include/ibdiag_common.h @@ -35,18 +35,42 @@ #ifndef _IBDIAG_COMMON_H_ #define _IBDIAG_COMMON_H_ +#include + extern int ibdebug; +extern int ibverbose; +extern char *ibd_ca; +extern int ibd_ca_port; +extern int ibd_dest_type; +extern ib_portid_t *ibd_sm_id; +extern int ibd_timeout; /*========================================================*/ /* External interface */ /*========================================================*/ #undef DEBUG -#define DEBUG if (ibdebug || verbose) IBWARN -#define VERBOSE if (ibdebug || verbose > 1) IBWARN +#define DEBUG if (ibdebug || ibverbose) IBWARN +#define VERBOSE if (ibdebug || ibverbose > 1) IBWARN #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) extern void iberror(const char *fn, char *msg, ...); extern const char *get_build_version(void); +struct ibdiag_opt { + const char *name; + char letter; + unsigned has_arg; + const char *arg_tmpl; + const char *description; +}; + +extern int ibdiag_process_opts(int argc, char * const argv[], void *context, + const char *exclude_common_str, + const struct ibdiag_opt custom_opts[], + int (*custom_handler)(void *cxt, int val, char *optarg), + const char *usage_args, + const char *usage_examples[]); +extern void ibdiag_show_usage(); + #endif /* _IBDIAG_COMMON_H_ */ diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index 3a9d5c2..8dcec5e 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -46,11 +46,259 @@ #include #include #include +#include +#include +#include #include #include int ibdebug; +int ibverbose; +char *ibd_ca; +int ibd_ca_port; +int ibd_dest_type = IB_DEST_LID; +ib_portid_t *ibd_sm_id; +int ibd_timeout; + +static ib_portid_t sm_portid = {0}; + +static const char *prog_name; +static const char *prog_args; +static const char **prog_examples; +static struct option *long_opts; +static const struct ibdiag_opt *opts_map[256]; + +static void pretty_print(int start, int width, const char *str) +{ + int len = width - start; + const char *p, *e; + + while (1) { + while(isspace(*str)) + str++; + p = str; + do { + e = p + 1; + p = strchr(e, ' '); + } while (p && p - str < len); + if (!p) { + fprintf(stderr, "%s", str); + break; + } + if (e - str == 1) + e = p; + fprintf(stderr, "%.*s\n%*s", e - str, str, start, ""); + str = e; + } +} + +void ibdiag_show_usage() +{ + struct option *o = long_opts; + int n; + + fprintf(stderr, "\nUsage: %s [options] %s\n\n", prog_name, + prog_args ? prog_args : ""); + + if (long_opts[0].name) + fprintf(stderr, "Options:\n"); + for (o = long_opts; o->name; o++) { + const struct ibdiag_opt *io = opts_map[o->val]; + n = fprintf(stderr, " --%s", io->name); + if (isprint(io->letter)) + n += fprintf(stderr, ", -%c", io->letter); + if (io->has_arg) + n += fprintf(stderr, " %s", + io->arg_tmpl ? io->arg_tmpl : ""); + if (io->description && *io->description) { + n += fprintf(stderr, "%*s ", 24 - n > 0 ? 24 - n : 0, ""); + pretty_print(n, 74, io->description); + } + fprintf(stderr, "\n"); + } + + if (prog_examples) { + const char **p; + fprintf(stderr, "\nExamples:\n"); + for (p = prog_examples; *p && **p; p++) + fprintf(stderr, " %s %s\n", prog_name, *p); + } + + fprintf(stderr, "\n"); + + exit(2); +} + +static int process_opt(int ch, char *optarg) +{ + int val; + + switch (ch) { + case 'h': + case 'u': + ibdiag_show_usage(); + break; + case 'V': + fprintf(stderr, "%s %s\n", prog_name, get_build_version()); + exit(2); + case 'e': + madrpc_show_errors(1); + break; + case 'v': + ibverbose++; + break; + case 'd': + ibdebug++; + madrpc_show_errors(1); + umad_debug(ibdebug - 1); + break; + case 'C': + ibd_ca = optarg; + break; + case 'P': + ibd_ca_port = strtoul(optarg, 0, 0); + break; + case 'D': + ibd_dest_type = IB_DEST_DRPATH; + break; + case 'L': + ibd_dest_type = IB_DEST_LID; + break; + case 'G': + ibd_dest_type = IB_DEST_GUID; + break; + case 't': + val = strtoul(optarg, 0, 0); + madrpc_set_timeout(val); + ibd_timeout = val; + break; + case 's': + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) + IBERROR("cannot resolve SM destination port %s", optarg); + ibd_sm_id = &sm_portid; + break; + default: + return -1; + } + + return 0; +} + +static const struct ibdiag_opt common_opts[] = { + { "Ca", 'C', 1, "", "Ca name to use"}, + { "Port", 'P', 1, "", "Ca port number to use"}, + { "Direct", 'D', 0, NULL, "use Direct address argument"}, + { "Lid", 'L', 0, NULL, "use LID address argument"}, + { "Guid", 'G', 0, NULL, "use GUID address argument"}, + { "timeout", 't', 1, "", "timeout in ms"}, + { "sm_port", 's', 1, "", "SM port lid" }, + { "errors", 'e', 0, NULL, "show send and receive errors" }, + { "verbose", 'v', 0, NULL, "increase verbosity level" }, + { "debug", 'd', 0, NULL, "raise debug level" }, + { "usage", 'u', 0, NULL, "usage message" }, + { "help", 'h', 0, NULL, "help message" }, + { "version", 'V', 0, NULL, "show version" }, + {} +}; + +static void make_opt(struct option *l, const struct ibdiag_opt *o, + const struct ibdiag_opt *map[]) +{ + l->name = o->name; + l->has_arg = o->has_arg; + l->flag = NULL; + l->val = o->letter; + if (!map[l->val]) + map[l->val] = o; +} + +static struct option *make_long_opts(const char *exclude_str, + const struct ibdiag_opt *custom_opts, + const struct ibdiag_opt *map[]) +{ + struct option *long_opts, *l; + const struct ibdiag_opt *o; + unsigned n = 0; + + if (custom_opts) + for (o = custom_opts; o->name; o++) + n++; + + long_opts = malloc((sizeof(common_opts)/sizeof(common_opts[0]) + n) * + sizeof(*long_opts)); + if (!long_opts) + return NULL; + + l = long_opts; + + if (custom_opts) + for (o = custom_opts; o->name; o++) + make_opt(l++, o, map); + + for (o = common_opts; o->name; o++) { + if (exclude_str && strchr(exclude_str, o->letter)) + continue; + make_opt(l++, o, map); + } + + memset(l, 0, sizeof(*l)); + + return long_opts; +} + +static void make_str_opts(const struct option *o, char *p, unsigned size) +{ + int i, n = 0; + + for (n = 0; o->name && n + 2 + o->has_arg < size; o++) { + p[n++] = o->val; + for (i = 0; i < o->has_arg; i++) + p[n++] = ':'; + } + p[n] = '\0'; +} + +int ibdiag_process_opts(int argc, char * const argv[], void *cxt, + const char *exclude_common_str, + const struct ibdiag_opt custom_opts[], + int (*custom_handler)(void *cxt, int val, char *optarg), + const char *usage_args, const char *usage_examples[]) +{ + char str_opts[1024]; + const struct ibdiag_opt *o; + + memset(opts_map, 0, sizeof(opts_map)); + + prog_name = argv[0]; + prog_args = usage_args; + prog_examples = usage_examples; + + long_opts = make_long_opts(exclude_common_str, custom_opts, opts_map); + if (!long_opts) + return -1; + + make_str_opts(long_opts, str_opts, sizeof(str_opts)); + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + o = opts_map[ch]; + if (!o) + ibdiag_show_usage(); + if (custom_handler) { + if (custom_handler(cxt, ch, optarg) && + process_opt(ch, optarg)) + ibdiag_show_usage(); + } else if (process_opt(ch, optarg)) + ibdiag_show_usage(); + } + + free(long_opts); + + return 0; +} extern char *argv0; -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Sun Jan 25 03:52:49 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 25 Jan 2009 13:52:49 +0200 Subject: [ofa-general] [PATCH] infiniband-diags: using common command line option processing In-Reply-To: <20090125115136.GA20419@sashak.voltaire.com> References: <20090125115136.GA20419@sashak.voltaire.com> Message-ID: <20090125115249.GB20419@sashak.voltaire.com> This converts infiniband-diags tools to use the common command line processing framework - it unifies the usages, options and removes a lot of duplications. The tools functionality is preserved (should not be a problems with backward compatibility), however the usage message is changed for many tools and now looks like: Usage: ibaddr [options] [] Options: --gid_show, -g show gid address only --lid_show, -l show lid range only --Lid_show, -L show lid range (in decimal) only --Ca, -C Ca name to use --Port, -P Ca port number to use --Direct, -D use Direct address argument --Guid, -G use GUID address argument --timeout, -t timeout in ms --sm_port, -s SM port lid --errors, -e show send and receive errors --verbose, -v increase verbosity level --debug, -d raise debug level --usage, -u usage message --help, -h help message --version, -V show version Examples: ibaddr # local port's address ibaddr 32 # show lid range and gid of lid 32 ibaddr -G 0x8f1040023 # same but using guid address ibaddr -l 32 # show lid range only ibaddr -L 32 # show decimal lid range only ibaddr -g 32 # show gid address only Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/ibaddr.c | 133 ++++--------- infiniband-diags/src/ibnetdiscover.c | 151 +++++--------- infiniband-diags/src/ibping.c | 147 ++++---------- infiniband-diags/src/ibportstate.c | 117 ++--------- infiniband-diags/src/ibroute.c | 159 +++++---------- infiniband-diags/src/ibsendtrap.c | 66 ++----- infiniband-diags/src/ibstat.c | 78 +++----- infiniband-diags/src/ibsysstat.c | 124 +++--------- infiniband-diags/src/ibtracert.c | 173 +++++----------- infiniband-diags/src/perfquery.c | 178 ++++++----------- infiniband-diags/src/saquery.c | 378 +++++++++++++++------------------- infiniband-diags/src/sminfo.c | 123 ++++-------- infiniband-diags/src/smpdump.c | 109 ++++------ infiniband-diags/src/smpquery.c | 152 ++++---------- infiniband-diags/src/vendstat.c | 118 +++-------- 15 files changed, 706 insertions(+), 1500 deletions(-) diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c index 688972b..4890da3 100644 --- a/infiniband-diags/src/ibaddr.c +++ b/infiniband-diags/src/ibaddr.c @@ -86,108 +86,53 @@ ib_resolve_addr(ib_portid_t *portid, int portnum, int show_lid, int show_gid) return 0; } -static void -usage(void) +static int show_lid, show_gid; + +static int process_opt(void *context, int ch, char *optarg) { - char *basename; - - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; - - fprintf(stderr, "Usage: %s [-d(ebug) -D(irect) -G(uid) -l(id_show) -g(id_show) -s(m_port) sm_lid -C ca_name -P ca_port " - "-t(imeout) timeout_ms -V(ersion) -h(elp)] []\n", - basename); - fprintf(stderr, "\tExamples:\n"); - fprintf(stderr, "\t\t%s\t\t\t# local port's address\n", basename); - fprintf(stderr, "\t\t%s 32\t\t# show lid range and gid of lid 32\n", basename); - fprintf(stderr, "\t\t%s -G 0x8f1040023\t# same but using guid address\n", basename); - fprintf(stderr, "\t\t%s -l 32\t\t# show lid range only\n", basename); - fprintf(stderr, "\t\t%s -L 32\t\t# show decimal lid range only\n", basename); - fprintf(stderr, "\t\t%s -g 32\t\t# show gid address only\n", basename); - exit(-1); + switch (ch) { + case 'g': + show_gid = 1; + break; + case 'l': + show_lid++; + break; + case 'L': + show_lid = -100; + break; + default: + return -1; + } + return 0; } -int -main(int argc, char **argv) +int main(int argc, char **argv) { int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; - ib_portid_t *sm_id = 0, sm_portid = {0}; ib_portid_t portid = {0}; - int dest_type = IB_DEST_LID; - int timeout = 0; /* use default */ - int show_lid = 0, show_gid = 0; int port = 0; - char *ca = 0; - int ca_port = 0; - - static char const str_opts[] = "C:P:t:s:dDGglLVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "Direct", 0, 0, 'D'}, - { "Guid", 0, 0, 'G'}, - { "gid_show", 0, 0, 'g'}, - { "lid_show", 0, 0, 'l'}, - { "Lid_show", 0, 0, 'L'}, - { "timeout", 1, 0, 't'}, - { "sm_port", 1, 0, 's'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, - { } + + const struct ibdiag_opt opts[] = { + { "gid_show", 'g', 0, NULL, "show gid address only"}, + { "lid_show", 'l', 0, NULL, "show lid range only"}, + { "Lid_show", 'L', 0, NULL, "show lid range (in decimal) only"}, + {} + }; + char usage_args[] = "[]"; + const char *usage_examples[] = { + "\t\t# local port's address", + "32\t\t# show lid range and gid of lid 32", + "-G 0x8f1040023\t# same but using guid address", + "-l 32\t\t# show lid range only", + "-L 32\t\t# show decimal lid range only", + "-g 32\t\t# show gid address only", + NULL }; - argv0 = argv[0]; + ibdiag_process_opts(argc, argv, NULL, "L", opts, process_opt, + usage_args, usage_examples); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'd': - ibdebug++; - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'g': - show_gid++; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'l': - show_lid++; - break; - case 'L': - show_lid = -100; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + argv0 = argv[0]; argc -= optind; argv += optind; @@ -197,10 +142,10 @@ main(int argc, char **argv) if (!show_lid && !show_gid) show_lid = show_gid = 1; - madrpc_init(ca, ca_port, mgmt_classes, 3); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { if (ib_resolve_self(&portid, &port, 0) < 0) diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 296cb07..04a250f 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -85,7 +85,6 @@ static char *linkspeed_str[] = { static int timeout = 2000; /* ms */ static int dumplevel = 0; -static int verbose; static FILE *f; char *argv0 = "ibnetdiscover"; @@ -919,119 +918,79 @@ void dump_ports_report () } } -void -usage(void) +static int list, group, ports_report; + +static int process_opt(void *context, int ch, char *optarg) { - fprintf(stderr, "Usage: %s [-d(ebug)] -e(rr_show) -v(erbose) -s(how) -l(ist) -g(rouping) -H(ca_list) -S(witch_list) -R(outer_list) -V(ersion) -C ca_name -P ca_port " - "-t(imeout) timeout_ms --node-name-map node-name-map] -p(orts) []\n", - argv0); - fprintf(stderr, " --node-name-map specify a node name map file\n"); - exit(-1); + switch (ch) { + case 1: + node_name_map_file = strdup(optarg); + break; + case 's': + dumplevel = 1; + break; + case 'l': + list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; + break; + case 'g': + group = 1; + break; + case 'S': + list = LIST_SWITCH_NODE; + break; + case 'H': + list = LIST_CA_NODE; + break; + case 'R': + list = LIST_ROUTER_NODE; + break; + case 'p': + ports_report = 1; + break; + default: + return -1; + } + + return 0; } -int -main(int argc, char **argv) +int main(int argc, char **argv) { int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; ib_portid_t my_portid = {0}; - int udebug = 0, list = 0; - char *ca = 0; - int ca_port = 0; - int group = 0; - int ports_report = 0; - - static char const str_opts[] = "C:P:t:devslgHSRpVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "err_show", 0, 0, 'e'}, - { "verbose", 0, 0, 'v'}, - { "show", 0, 0, 's'}, - { "list", 0, 0, 'l'}, - { "grouping", 0, 0, 'g'}, - { "Hca_list", 0, 0, 'H'}, - { "Switch_list", 0, 0, 'S'}, - { "Router_list", 0, 0, 'R'}, - { "timeout", 1, 0, 't'}, - { "node-name-map", 1, 0, 1}, - { "ports", 0, 0, 'p'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, + + const struct ibdiag_opt opts[] = { + { "show", 's', 0, NULL, "show more information" }, + { "list", 'l', 0, NULL, "list of connected nodes" }, + { "grouping", 'g', 0, NULL, "show grouping" }, + { "Hca_list", 'H', 0, NULL, "list of connected CAs" }, + { "Switch_list", 'S', 0, NULL, "list of connected switches" }, + { "Router_list", 'R', 0, NULL, "list of connected routers" }, + { "node-name-map", 1, 1, "", "node name map file" }, + { "ports", 'p', 0, NULL, "obtain a ports report" }, { } }; + char usage_args[] = "[topology-file]"; + + ibdiag_process_opts(argc, argv, NULL, "sGDL", opts, process_opt, + usage_args, NULL); f = stdout; argv0 = argv[0]; - - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 1: - node_name_map_file = strdup(optarg); - break; - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - break; - case 'v': - verbose++; - dumplevel++; - break; - case 's': - dumplevel = 1; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'l': - list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; - break; - case 'g': - group = 1; - break; - case 'S': - list = LIST_SWITCH_NODE; - break; - case 'H': - list = LIST_CA_NODE; - break; - case 'R': - list = LIST_ROUTER_NODE; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - case 'p': - ports_report = 1; - break; - default: - usage(); - break; - } - } argc -= optind; argv += optind; + if (ibd_timeout) + timeout = ibd_timeout; + + if (ibverbose) + dumplevel = 1; + if (argc && !(f = fopen(argv[0], "w"))) IBERROR("can't open file %s for writing", argv[0]); - madrpc_init(ca, ca_port, mgmt_classes, 2); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); node_name_map = open_node_name_map(node_name_map_file); if (discover(&my_portid) < 0) diff --git a/infiniband-diags/src/ibping.c b/infiniband-diags/src/ibping.c index 7e51210..241d0ad 100644 --- a/infiniband-diags/src/ibping.c +++ b/infiniband-diags/src/ibping.c @@ -50,11 +50,6 @@ #include "ibdiag_common.h" -#undef DEBUG -#define DEBUG if (verbose) IBWARN - -static int dest_type = IB_DEST_LID; -static int verbose; static char host_and_domain[IB_VENDOR_RANGE2_DATA_SIZE]; static char last_host[IB_VENDOR_RANGE2_DATA_SIZE]; @@ -152,28 +147,11 @@ ibping(ib_portid_t *portid, int quiet) return rtt; } -static void -usage(void) -{ - char *basename; - - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; - - fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -G(uid) -s smlid -V(ersion) -C ca_name -P ca_port " - "-t(imeout) timeout_ms -c ping_count -f(lood) -o oui -S(erver)] \n", - basename); - exit(-1); -} - static uint64_t minrtt = ~0ull, maxrtt, total_rtt; static uint64_t start, total_time, replied, lost, ntrans; static ib_portid_t portid = {0}; -void -report(int sig) +void report(int sig) { total_time = getcurrenttime() - start; @@ -193,104 +171,57 @@ report(int sig) exit(0); } -int -main(int argc, char **argv) +static int server = 0, flood = 0, oui = IB_OPENIB_OUI; +static unsigned count = ~0; + +static int process_opt(void *context, int ch, char *optarg) +{ + switch (ch) { + case 'c': + count = strtoul(optarg, 0, 0); + break; + case 'f': + flood++; + break; + case 'o': + oui = strtoul(optarg, 0, 0); + break; + case 'S': + server++; + break; + default: + return -1; + } + return 0; +} + +int main(int argc, char **argv) { int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; int ping_class = IB_VENDOR_OPENIB_PING_CLASS; - ib_portid_t *sm_id = 0, sm_portid = {0}; - int timeout = 0, udebug = 0, server = 0, flood = 0; - int oui = IB_OPENIB_OUI; uint64_t rtt; - unsigned count = ~0; char *err; - char *ca = 0; - int ca_port = 0; - - static char const str_opts[] = "C:P:t:s:c:o:devGfSVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "err_show", 0, 0, 'e'}, - { "verbose", 0, 0, 'v'}, - { "Guid", 0, 0, 'G'}, - { "s", 1, 0, 's'}, - { "timeout", 1, 0, 't'}, - { "c", 1, 0, 'c'}, - { "flood", 0, 0, 'f'}, - { "o", 1, 0, 'o'}, - { "Server", 0, 0, 'S'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, + + const struct ibdiag_opt opts[] = { + { "count", 'c', 1, "", "stop after count packets" }, + { "flood", 'f', 0, NULL, "flood destination" }, + { "oui", 'o', 1, NULL, "use specified OUI number" }, + { "Server", 'S', 0, NULL, "start in server mode" }, { } }; + char usage_args[] = ""; - argv0 = argv[0]; + ibdiag_process_opts(argc, argv, NULL, "D", opts, process_opt, + usage_args, NULL); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'c': - count = strtoul(optarg, 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'f': - flood++; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'o': - oui = strtoul(optarg, 0, 0); - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 'S': - server++; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + argv0 = argv[0]; argc -= optind; argv += optind; if (!argc && !server) - usage(); + ibdiag_show_usage(); - madrpc_init(ca, ca_port, mgmt_classes, 3); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (server) { if (mad_register_server(ping_class, 0, 0, oui) < 0) @@ -306,7 +237,7 @@ main(int argc, char **argv) if (mad_register_client(ping_class, 0) < 0) IBERROR("can't register ping class %d on this port", ping_class); - if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", argv[0]); signal(SIGINT, report); diff --git a/infiniband-diags/src/ibportstate.c b/infiniband-diags/src/ibportstate.c index 89675ed..c2e4028 100644 --- a/infiniband-diags/src/ibportstate.c +++ b/infiniband-diags/src/ibportstate.c @@ -48,12 +48,6 @@ #include "ibdiag_common.h" -#undef DEBUG -#define DEBUG if (verbose>1) IBWARN - -static int dest_type = IB_DEST_LID; -static int verbose; - char *argv0 = "ibportstate"; /*******************************************/ @@ -195,39 +189,11 @@ validate_speed(int speed, int peerspeed, int lsa) } } -void -usage(void) -{ - char *basename; - - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; - - fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -D(irect) -G(uid) -s smlid -V(ersion) -C ca_name -P ca_port " - "-t(imeout) timeout_ms] []\n", - basename); - fprintf(stderr, "\tsupported ops: enable, disable, reset, speed, query\n"); - fprintf(stderr, "\n\texamples:\n"); - fprintf(stderr, "\t\t%s 3 1 disable\t\t\t# by lid\n", basename); - fprintf(stderr, "\t\t%s -G 0x2C9000100D051 1 enable\t# by guid\n", basename); - fprintf(stderr, "\t\t%s -D 0 1\t\t\t# (query) by direct route\n", basename); - fprintf(stderr, "\t\t%s 3 1 reset\t\t\t# by lid\n", basename); - fprintf(stderr, "\t\t%s 3 1 speed 1\t\t\t# by lid\n", basename); - exit(-1); -} - -int -main(int argc, char **argv) +int main(int argc, char **argv) { int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; ib_portid_t portid = {0}; - ib_portid_t *sm_id = 0, sm_portid = {0}; int err; - int timeout = 0, udebug = 0; - char *ca = 0; - int ca_port = 0; int port_op = 0; /* default to query */ int speed = 15; int is_switch = 1; @@ -240,80 +206,31 @@ main(int argc, char **argv) ib_portid_t selfportid = {0}; int selfport = 0; - static char const str_opts[] = "C:P:t:s:devDGVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "err_show", 0, 0, 'e'}, - { "verbose", 0, 0, 'v'}, - { "Direct", 0, 0, 'D'}, - { "Guid", 0, 0, 'G'}, - { "timeout", 1, 0, 't'}, - { "s", 1, 0, 's'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, - { } + char usage_args[] = " []\n" + "\nSupported ops: enable, disable, reset, speed, query"; + const char *usage_examples[] = { + "3 1 disable\t\t\t# by lid", + "-G 0x2C9000100D051 1 enable\t# by guid", + "-D 0 1\t\t\t# (query) by direct route", + "3 1 reset\t\t\t# by lid", + "3 1 speed 1\t\t\t# by lid", + NULL }; - argv0 = argv[0]; - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + ibdiag_process_opts(argc, argv, NULL, NULL, NULL, NULL, + usage_args, usage_examples); + + argv0 = argv[0]; argc -= optind; argv += optind; if (argc < 2) - usage(); + ibdiag_show_usage(); - madrpc_init(ca, ca_port, mgmt_classes, 3); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); - if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", argv[0]); /* First, make sure it is a switch port if it is a "set" */ diff --git a/infiniband-diags/src/ibroute.c b/infiniband-diags/src/ibroute.c index 921b5dd..bcc8d99 100644 --- a/infiniband-diags/src/ibroute.c +++ b/infiniband-diags/src/ibroute.c @@ -52,10 +52,7 @@ #include "ibdiag_common.h" -static int dest_type = IB_DEST_LID; -static int brief; -static int verbose; -static int dump_all; +static int brief, dump_all, multicast; char *argv0 = "ibroute"; @@ -191,7 +188,7 @@ dump_multicast_tables(ib_portid_t *portid, int startlid, int endlid) printf(" Ports: %s\n", str); printf(" MLid\n"); } - if (verbose) + if (ibverbose) printf("Switch muticast mlids capability is 0x%d\n", cap); chunks = ALIGN(nports + 1, 16) / 16; @@ -349,138 +346,76 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid) return 0; } -void -usage(void) +static int process_opt(void *context, int ch, char *optarg) { - char *basename; - - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; - - fprintf(stderr, "Usage: %s [-d(ebug)] -a(ll) -n(o_dests) -v(erbose) -D(irect) -G(uid) -M(ulticast) -s smlid -V(ersion) -C ca_name -P ca_port " - "-t(imeout) timeout_ms] [ [ []]]\n", - basename); - fprintf(stderr, "\n\tUnicast examples:\n"); - fprintf(stderr, "\t\t%s 4\t# dump all lids with valid out ports of switch with lid 4\n", basename); - fprintf(stderr, "\t\t%s -a 4\t# same, but dump all lids, even with invalid out ports\n", basename); - fprintf(stderr, "\t\t%s -n 4\t# simple dump format - no destination resolving\n", basename); - fprintf(stderr, "\t\t%s 4 10\t# dump lids starting from 10\n", basename); - fprintf(stderr, "\t\t%s 4 0x10 0x20\t# dump lid range\n", basename); - fprintf(stderr, "\t\t%s -G 0x08f1040023\t# resolve switch by GUID\n", basename); - fprintf(stderr, "\t\t%s -D 0,1\t# resolve switch by direct path\n", basename); - - fprintf(stderr, "\n\tMulticast examples:\n"); - fprintf(stderr, "\t\t%s -M 4\t# dump all non empty mlids of switch with lid 4\n", basename); - fprintf(stderr, "\t\t%s -M 4 0xc010 0xc020\t# same, but with range\n", basename); - fprintf(stderr, "\t\t%s -M -n 4\t# simple dump format\n", basename); - exit(-1); + switch (ch) { + case 'a': + dump_all++; + break; + case 'M': + multicast++; + break; + case 'n': + brief++; + break; + default: + return -1; + } + return 0; } -int -main(int argc, char **argv) +int main(int argc, char **argv) { int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; ib_portid_t portid = {0}; - ib_portid_t *sm_id = 0, sm_portid = {0}; - int timeout; - int multicast = 0, startlid = 0, endlid = 0; + int startlid = 0, endlid = 0; char *err; - char *ca = 0; - int ca_port = 0; - - static char const str_opts[] = "C:P:t:s:danvDGMVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "all", 0, 0, 'a'}, - { "no_dests", 0, 0, 'n'}, - { "verbose", 0, 0, 'v'}, - { "Direct", 0, 0, 'D'}, - { "Guid", 0, 0, 'G'}, - { "Multicast", 0, 0, 'M'}, - { "timeout", 1, 0, 't'}, - { "s", 1, 0, 's'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, + + const struct ibdiag_opt opts[] = { + { "all", 'a', 0, NULL, "show all lids, even invalid entries" }, + { "no_dests", 'n', 0, NULL, "do not try to resolve destinations" }, + { "Multicast", 'M', 0, NULL, "show multicast forwarding tables" }, { } }; + char usage_args[] = "[ [ []]]"; + const char *usage_examples[] = { + " -- Unicast examples:", + "4\t# dump all lids with valid out ports of switch with lid 4", + "-a 4\t# same, but dump all lids, even with invalid out ports", + "-n 4\t# simple dump format - no destination resolving", + "4 10\t# dump lids starting from 10", + "4 0x10 0x20\t# dump lid range", + "-G 0x08f1040023\t# resolve switch by GUID", + "-D 0,1\t# resolve switch by direct path", + " -- Multicast examples:", + "-M 4\t# dump all non empty mlids of switch with lid 4", + "-M 4 0xc010 0xc020\t# same, but with range", + "-M -n 4\t# simple dump format", + NULL, + }; - argv0 = argv[0]; + ibdiag_process_opts(argc, argv, NULL, NULL, opts, process_opt, + usage_args, usage_examples); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'a': - dump_all++; - break; - case 'd': - ibdebug++; - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'M': - multicast++; - break; - case 'n': - brief++; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - madrpc_show_errors(1); - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + argv0 = argv[0]; argc -= optind; argv += optind; if (!argc) - usage(); + ibdiag_show_usage(); if (argc > 1) startlid = strtoul(argv[1], 0, 0); if (argc > 2) endlid = strtoul(argv[2], 0, 0); - madrpc_init(ca, ca_port, mgmt_classes, 3); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (!argc) { if (ib_resolve_self(&portid, 0, 0) < 0) IBERROR("can't resolve self addr"); - } else { - if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) - IBERROR("can't resolve destination port %s", argv[1]); - } + } else if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) + IBERROR("can't resolve destination port %s", argv[1]); if (multicast) err = dump_multicast_tables(&portid, startlid, endlid); diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index 66620de..a733a2c 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -47,7 +47,7 @@ #include "ibdiag_common.h" -char *argv0 = ""; +char *argv0 = "ibsendtrap"; static int send_144_node_desc_update(void) { @@ -95,23 +95,6 @@ trap_def_t traps[2] = { {NULL, NULL} }; -static void usage(void) -{ - int i; - - fprintf(stderr, "Usage: %s [-hV]" - " [-C ] [-P ] []\n", argv0); - fprintf(stderr, " -V print version\n"); - fprintf(stderr, " can be one of the following\n"); - for (i = 0; traps[i].trap_name; i++) { - fprintf(stderr, " %s\n", traps[i].trap_name); - } - fprintf(stderr, " default behavior is to send \"%s\"\n", - traps[0].trap_name); - - exit(-1); -} - int send_trap(char *trap_name) { int i; @@ -121,45 +104,32 @@ int send_trap(char *trap_name) return (traps[i].send_func()); } } - usage(); + ibdiag_show_usage(); exit(1); } int main(int argc, char **argv) { + char usage_args[1024]; int mgmt_classes[2] = { IB_SMI_CLASS, IB_SMI_DIRECT_CLASS }; - int ch = 0; char *trap_name = NULL; - char *ca = NULL; - int ca_port = 0; - - static char const str_opts[] = "hVP:C:"; - static const struct option long_opts[] = { - {"Version", 0, 0, 'V'}, - {"P", 1, 0, 'P'}, - {"C", 1, 0, 'C'}, - {"help", 0, 0, 'h'}, - {} - }; + int i, n; - argv0 = argv[0]; - - while ((ch = getopt_long(argc, argv, str_opts, long_opts, NULL)) != -1) { - switch (ch) { - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version()); + n = sprintf(usage_args, "[]\n" + "\nArgument can be one of the following:\n"); + for (i = 0; traps[i].trap_name; i++) { + n += snprintf(usage_args + n, sizeof(usage_args) - n, + " %s\n", traps[i].trap_name); + if (n >= sizeof(usage_args)) exit(-1); - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, NULL, 0); - break; - case 'h': - default: - usage(); - } } + snprintf(usage_args + n, sizeof(usage_args) - n, + "\n default behavior is to send \"%s\"", traps[0].trap_name); + + ibdiag_process_opts(argc, argv, NULL, "DLG", NULL, NULL, + usage_args, NULL); + + argv0 = argv[0]; argc -= optind; argv += optind; @@ -170,7 +140,7 @@ int main(int argc, char **argv) } madrpc_show_errors(1); - madrpc_init(ca, ca_port, mgmt_classes, 2); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 2); return (send_trap(trap_name)); } diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index 6bd3c8a..c4f965e 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -49,8 +49,6 @@ #include -static int debug; - char *argv0 = "ibstat"; static char *node_type_str[] = { @@ -174,63 +172,49 @@ ports_list(char names[][UMAD_CA_NAME_LEN], int n) return found; } -void -usage(void) +static int list_only, short_format, list_ports; + +static int process_opt(void *context, int ch, char *optarg) { - fprintf(stderr, "Usage: %s [-d(ebug) -l(ist_of_cas) -s(hort) -p(ort_list) -V(ersion)] [portnum]\n", argv0); - fprintf(stderr, "\tExamples:\n"); - fprintf(stderr, "\t\t%s -l # list all IB devices\n", argv0); - fprintf(stderr, "\t\t%s mthca0 2 # stat port 2 of 'mthca0'\n", argv0); - exit(-1); + switch (ch) { + case 'l': + list_only++; + break; + case 's': + short_format++; + break; + case 'p': + list_ports++; + break; + default: + return -1; + } + return 0; } -int -main(int argc, char *argv[]) +int main(int argc, char *argv[]) { char names[UMAD_MAX_DEVICES][UMAD_CA_NAME_LEN]; int dev_port = -1; - int list_only = 0, short_format = 0, list_ports = 0; int n, i; - static char const str_opts[] = "dlspVhu"; - static const struct option long_opts[] = { - { "debug", 0, 0, 'd'}, - { "list_of_cas", 0, 0, 'l'}, - { "short", 0, 0, 's'}, - { "port_list", 0, 0, 'p'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, + const struct ibdiag_opt opts[] = { + { "list_of_cas", 'l', 0, NULL, "list all IB devices" }, + { "short", 's', 0, NULL, "short output" }, + { "port_list", 'p', 0, NULL, "show port list" }, { } }; + char usage_args[] = " [portnum]"; + const char *usage_examples[] = { + "-l # list all IB devices", + "mthca0 2 # stat port 2 of 'mthca0'", + NULL + }; - argv0 = argv[0]; + ibdiag_process_opts(argc, argv, NULL, "sDGLCPte", opts, process_opt, + usage_args, usage_examples); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'd': - debug++; - break; - case 'l': - list_only++; - break; - case 's': - short_format++; - break; - case 'p': - list_ports++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + argv0 = argv[0]; argc -= optind; argv += optind; diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index 29d6327..792e8f4 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -48,12 +48,6 @@ #include "ibdiag_common.h" -#undef DEBUG -#define DEBUG if (verbose) IBWARN - -static int dest_type = IB_DEST_LID; -static int verbose; - #define MAX_CPUS 8 enum ib_sysstat_attr_t { @@ -219,114 +213,52 @@ build_cpuinfo(void) return ncpu; } -static void -usage(void) -{ - char *basename; - - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; +static int server = 0, oui = IB_OPENIB_OUI; - fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -G(uid) -s smlid -V(ersion) -C ca_name -P ca_port " - "-t(imeout) timeout_ms -o oui -S(erver)] []\n", - basename); - exit(-1); +static int process_opt(void *context, int ch, char *optarg) +{ + switch (ch) { + case 'o': + oui = strtoul(optarg, 0, 0); + break; + case 'S': + server++; + break; + default: + return -1; + } + return 0; } -int -main(int argc, char **argv) +int main(int argc, char **argv) { int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; int sysstat_class = IB_VENDOR_OPENIB_SYSSTAT_CLASS; ib_portid_t portid = {0}; - ib_portid_t *sm_id = 0, sm_portid = {0}; - int timeout = 0, udebug = 0, server = 0; - int oui = IB_OPENIB_OUI, attr = IB_PING_ATTR; + int attr = IB_PING_ATTR; char *err; - char *ca = 0; - int ca_port = 0; - - static char const str_opts[] = "C:P:t:s:o:devGSVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "err_show", 0, 0, 'e'}, - { "verbose", 0, 0, 'v'}, - { "Guid", 0, 0, 'G'}, - { "timeout", 1, 0, 't'}, - { "s", 1, 0, 's'}, - { "o", 1, 0, 'o'}, - { "Server", 0, 0, 'S'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, + + const struct ibdiag_opt opts[] = { + { "oui", 'o', 1, NULL, "use specified OUI number" }, + { "Server", 'S', 0, NULL, "start in server mode" }, { } }; + char usage_args[] = " []"; - argv0 = argv[0]; + ibdiag_process_opts(argc, argv, NULL, "D", opts, process_opt, + usage_args, NULL); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'o': - oui = strtoul(optarg, 0, 0); - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 'S': - server++; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + argv0 = argv[0]; argc -= optind; argv += optind; if (!argc && !server) - usage(); + ibdiag_show_usage(); if (argc > 1 && (attr = match_attr(argv[1])) < 0) - usage(); + ibdiag_show_usage(); - madrpc_init(ca, ca_port, mgmt_classes, 3); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (server) { if (mad_register_server(sysstat_class, 0, 0, oui) < 0) @@ -342,7 +274,7 @@ main(int argc, char **argv) if (mad_register_client(sysstat_class, 0) < 0) IBERROR("can't register to sysstat class %d", sysstat_class); - if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", argv[0]); if ((err = ibsystat(&portid, attr))) diff --git a/infiniband-diags/src/ibtracert.c b/infiniband-diags/src/ibtracert.c index 7a28940..5b9a210 100644 --- a/infiniband-diags/src/ibtracert.c +++ b/infiniband-diags/src/ibtracert.c @@ -63,7 +63,6 @@ static char *node_type_str[] = { }; static int timeout = 0; /* ms */ -static int verbose; static int force; static FILE *f; @@ -220,7 +219,7 @@ dump_route(int dump, Node *node, int outport, Port *port) { char *nodename = NULL; - if (!dump && !verbose) + if (!dump && !ibverbose) return; nodename = remap_node_name(node_name_map, node->nodeguid, node->nodedesc); @@ -413,7 +412,7 @@ new_node(Node *node, Port *port, ib_portid_t *path, int dist) if (port->portguid == target_portguid) { node->dist = -1; /* tag as target */ link_port(port, node); - dump_endnode(verbose, "found target", node, port); + dump_endnode(ibverbose, "found target", node, port); return 1; /* found; */ } @@ -522,7 +521,7 @@ find_mcpath(ib_portid_t *from, int mlid) path = &node->path; VERBOSE("dist %d node %p", dist, node); - dump_endnode(verbose, "processing", node, node->ports); + dump_endnode(ibverbose, "processing", node, node->ports); memset(map, 0, sizeof(map)); @@ -599,7 +598,7 @@ find_mcpath(ib_portid_t *from, int mlid) return remotenode; if (r == 0) - dump_endnode(verbose, "new remote", + dump_endnode(ibverbose, "new remote", remotenode, remoteport); else if (remotenode->type == IB_NODE_SWITCH) dump_endnode(2, "ERR: circle discovered at", @@ -686,140 +685,82 @@ static int resolve_lid(ib_portid_t *portid, const void *srcport) return 0; } -static void -usage(void) -{ - char *basename; +static int dumplevel = 2, multicast, mlid; - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; - - fprintf(stderr, "Usage: %s [-d(ebug) -v(erbose) -D(irect) -G(uids) -n(o_info) -C ca_name -P ca_port " - "-s smlid -t(imeout) timeout_ms -m mlid --node-name-map node-name-map ] \n", - basename); - fprintf(stderr, "\n\tUnicast examples:\n"); - fprintf(stderr, "\t\t%s 4 16\t\t\t# show path between lids 4 and 16\n", basename); - fprintf(stderr, "\t\t%s -n 4 16\t\t# same, but using simple output format\n", basename); - fprintf(stderr, "\t\t%s -G 0x8f1040396522d 0x002c9000100d051\t# use guid addresses\n", basename); - - fprintf(stderr, "\n\tMulticast example:\n"); - fprintf(stderr, "\t\t%s -m 0xc000 4 16\t# show multicast path of mlid 0xc000 between lids 4 and 16\n", basename); - exit(-1); +static int process_opt(void *context, int ch, char *optarg) +{ + switch (ch) { + case 1: + node_name_map_file = strdup(optarg); + break; + case 'm': + multicast++; + mlid = strtoul(optarg, 0, 0); + break; + case 'f': + force++; + break; + case 'n': + dumplevel = 1; + break; + default: + return -1; + } + return 0; } -int -main(int argc, char **argv) +int main(int argc, char **argv) { int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; ib_portid_t my_portid = {0}; ib_portid_t src_portid = {0}; ib_portid_t dest_portid = {0}; - ib_portid_t *sm_id = 0, sm_portid = {0}; - int dumplevel = 2, dest_type = IB_DEST_LID, multicast = 0, mlid = 0; Node *endnode; - int udebug = 0; - char *ca = 0; - int ca_port = 0; - - static char const str_opts[] = "C:P:t:s:m:dvfDGnVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "verbose", 0, 0, 'v'}, - { "force", 0, 0, 'f'}, - { "Direct", 0, 0, 'D'}, - { "Guids", 0, 0, 'G'}, - { "no_info", 0, 0, 'n'}, - { "timeout", 1, 0, 't'}, - { "s", 1, 0, 's'}, - { "m", 1, 0, 'm'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, - { "node-name-map", 1, 0, 1}, + + const struct ibdiag_opt opts[] = { + { "force", 'f', 0, NULL, "force" }, + { "no_info", 'n', 0, NULL, "simple format" }, + { "mlid", 'm', 1, "", "multicast trace of the mlid" }, + { "node-name-map", 1, 1, "", "node name map file" }, { } }; + char usage_args[] = " "; + const char *usage_examples[] = { + "- Unicast examples:", + "4 16\t\t\t# show path between lids 4 and 16", + "-n 4 16\t\t# same, but using simple output format", + "-G 0x8f1040396522d 0x002c9000100d051\t# use guid addresses", + + " - Multicast examples:", + "-m 0xc000 4 16\t# show multicast path of mlid 0xc000 between lids 4 and 16", + NULL, + }; - argv0 = argv[0]; - f = stdout; + ibdiag_process_opts(argc, argv, NULL, NULL, opts, process_opt, + usage_args, usage_examples); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 1: - node_name_map_file = strdup(optarg); - break; - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'm': - multicast++; - mlid = strtoul(optarg, 0, 0); - break; - case 'f': - force++; - break; - case 'n': - dumplevel = 1; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - madrpc_show_errors(1); - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + f = stdout; + argv0 = argv[0]; argc -= optind; argv += optind; if (argc < 2) - usage(); + ibdiag_show_usage(); + + if (ibd_timeout) + timeout = ibd_timeout; - madrpc_init(ca, ca_port, mgmt_classes, 3); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); node_name_map = open_node_name_map(node_name_map_file); - if (ib_resolve_portid_str(&src_portid, argv[0], dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&src_portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve source port %s", argv[0]); - if (ib_resolve_portid_str(&dest_portid, argv[1], dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&dest_portid, argv[1], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", argv[1]); - if (dest_type == IB_DEST_DRPATH) { + if (ibd_dest_type == IB_DEST_DRPATH) { if (resolve_lid(&src_portid, NULL) < 0) IBERROR("cannot resolve lid for port \'%s\'", portid2str(&src_portid)); @@ -830,10 +771,10 @@ main(int argc, char **argv) if (dest_portid.lid == 0 || src_portid.lid == 0) { IBWARN("bad src/dest lid"); - usage(); + ibdiag_show_usage(); } - if (dest_type != IB_DEST_DRPATH) { + if (ibd_dest_type != IB_DEST_DRPATH) { /* first find a direct path to the src port */ if (find_route(&my_portid, &src_portid, 0) < 0) IBERROR("can't find a route to the src port"); diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c index 6662926..5bf15c5 100644 --- a/infiniband-diags/src/perfquery.c +++ b/infiniband-diags/src/perfquery.c @@ -92,34 +92,6 @@ char *argv0 = "perfquery"; #define ALL_PORTS 0xFF -static void -usage(void) -{ - char *basename; - - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; - - fprintf(stderr, "Usage: %s [-d(ebug) -G(uid) -a(ll_ports) -l(oop_ports) -r(eset_after_read) -C ca_name -P ca_port " - "-R(eset_only) -t(imeout) timeout_ms -V(ersion) -h(elp)] [ [[port] [reset_mask]]]\n", - basename); - fprintf(stderr, "\tExamples:\n"); - fprintf(stderr, "\t\t%s\t\t# read local port's performance counters\n", basename); - fprintf(stderr, "\t\t%s 32 1\t\t# read performance counters from lid 32, port 1\n", basename); - fprintf(stderr, "\t\t%s -e 32 1\t# read extended performance counters from lid 32, port 1\n", basename); - fprintf(stderr, "\t\t%s -a 32\t\t# read performance counters from lid 32, all ports\n", basename); - fprintf(stderr, "\t\t%s -r 32 1\t# read performance counters and reset\n", basename); - fprintf(stderr, "\t\t%s -e -r 32 1\t# read extended performance counters and reset\n", basename); - fprintf(stderr, "\t\t%s -R 0x20 1\t# reset performance counters of port 1 only\n", basename); - fprintf(stderr, "\t\t%s -e -R 0x20 1\t# reset extended performance counters of port 1 only\n", basename); - fprintf(stderr, "\t\t%s -R -a 32\t# reset performance counters of all ports\n", basename); - fprintf(stderr, "\t\t%s -R 32 2 0x0fff\t# reset only error counters of port 2\n", basename); - fprintf(stderr, "\t\t%s -R 32 2 0xf000\t# reset only non-error counters of port 2\n", basename); - exit(-1); -} - /* Notes: IB semantics is to cap counters if count has exceeded limits. * Therefore we must check for overflows and cap the counters if necessary. * @@ -338,104 +310,74 @@ static void reset_counters(int extended, int timeout, int mask, ib_portid_t *por } } -int -main(int argc, char **argv) +static int reset, reset_only, all_ports, loop_ports, port, extended; + +static int process_opt(void *context, int ch, char *optarg) +{ + switch (ch) { + case 'e': + extended = 1; + break; + case 'a': + all_ports++; + port = ALL_PORTS; + break; + case 'l': + loop_ports++; + break; + case 'r': + reset++; + break; + case 'R': + reset_only++; + break; + default: + return -1; + } + return 0; +} + +int main(int argc, char **argv) { int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS}; - ib_portid_t *sm_id = 0, sm_portid = {0}; ib_portid_t portid = {0}; - int dest_type = IB_DEST_LID; - int timeout = 0; /* use default */ - int mask = 0xffff, all_ports = 0; - int reset = 0, reset_only = 0; - int port = 0; - int udebug = 0; - char *ca = 0; - int ca_port = 0; - int extended = 0; + int mask = 0xffff; uint16_t cap_mask; int all_ports_loop = 0; - int loop_ports = 0; int node_type, num_ports = 0; uint8_t data[IB_SMP_DATA_SIZE]; int start_port = 1; int enhancedport0; int i; - static char const str_opts[] = "C:P:s:t:dGealrRVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "Guid", 0, 0, 'G'}, - { "extended", 0, 0, 'e'}, - { "all_ports", 0, 0, 'a'}, - { "loop_ports", 0, 0, 'l'}, - { "reset_after_read", 0, 0, 'r'}, - { "Reset_only", 0, 0, 'R'}, - { "sm_portid", 1, 0, 's'}, - { "timeout", 1, 0, 't'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, + const struct ibdiag_opt opts[] = { + { "extended", 'e', 0, NULL, "show extended port counters" }, + { "all_ports", 'a', 0, NULL, "show aggregated counters" }, + { "loop_ports", 'l', 0, NULL, "iterate through each port" }, + { "reset_after_read", 'r', 0, NULL, "reset counters after read" }, + { "Reset_only", 'R', 0, NULL, "only reset counters" }, { } }; + char usage_args[] = " [ [[port] [reset_mask]]]"; + const char *usage_examples[] = { + "\t\t# read local port's performance counters", + "32 1\t\t# read performance counters from lid 32, port 1", + "-e 32 1\t# read extended performance counters from lid 32, port 1", + "-a 32\t\t# read performance counters from lid 32, all ports", + "-r 32 1\t# read performance counters and reset", + "-e -r 32 1\t# read extended performance counters and reset", + "-R 0x20 1\t# reset performance counters of port 1 only", + "-e -R 0x20 1\t# reset extended performance counters of port 1 only", + "-R -a 32\t# reset performance counters of all ports", + "-R 32 2 0x0fff\t# reset only error counters of port 2", + "-R 32 2 0xf000\t# reset only non-error counters of port 2", + NULL, + }; - argv0 = argv[0]; + ibdiag_process_opts(argc, argv, NULL, "De", opts, process_opt, + usage_args, usage_examples); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'e': - extended = 1; - break; - case 'a': - all_ports++; - port = ALL_PORTS; - break; - case 'l': - loop_ports++; - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 'r': - reset++; - break; - case 'R': - reset_only++; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + argv0 = argv[0]; argc -= optind; argv += optind; @@ -444,10 +386,10 @@ main(int argc, char **argv) if (argc > 2) mask = strtoul(argv[2], 0, 0); - madrpc_init(ca, ca_port, mgmt_classes, 4); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 4); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { if (ib_resolve_self(&portid, &port, 0) < 0) @@ -455,7 +397,7 @@ main(int argc, char **argv) } /* PerfMgt ClassPortInfo is a required attribute */ - if (!perf_classportinfo_query(pc, &portid, port, timeout)) + if (!perf_classportinfo_query(pc, &portid, port, ibd_timeout)) IBERROR("classportinfo query"); /* ClassPortInfo should be supported as part of libibmad */ memcpy(&cap_mask, pc + 2, sizeof(cap_mask)); /* CapabilityMask */ @@ -491,7 +433,7 @@ main(int argc, char **argv) if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { for (i = start_port; i <= num_ports; i++) - dump_perfcounters(extended, timeout, cap_mask, &portid, i, + dump_perfcounters(extended, ibd_timeout, cap_mask, &portid, i, (all_ports_loop && !loop_ports)); if (all_ports_loop && !loop_ports) { if (extended != 1) @@ -501,7 +443,7 @@ main(int argc, char **argv) } } else - dump_perfcounters(extended, timeout, cap_mask, &portid, port, 0); + dump_perfcounters(extended, ibd_timeout, cap_mask, &portid, port, 0); if (!reset) exit(0); @@ -513,10 +455,10 @@ do_reset: if (all_ports_loop || (loop_ports && (all_ports || port == ALL_PORTS))) { for (i = start_port; i <= num_ports; i++) - reset_counters(extended, timeout, mask, &portid, i); + reset_counters(extended, ibd_timeout, mask, &portid, i); } else - reset_counters(extended, timeout, mask, &portid, port); + reset_counters(extended, ibd_timeout, mask, &portid, port); exit(0); } diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 22d3186..c86d8b4 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -82,8 +82,6 @@ osmv_query_res_t result; osm_log_t log_osm; osm_mad_pool_t mad_pool; osm_vendor_t *vendor = NULL; -int osm_debug = 0; -uint32_t sa_timeout_ms = DEFAULT_SA_TIMEOUT_MS; char *sa_hca_name = NULL; uint32_t sa_port_num = 0; @@ -691,7 +689,7 @@ get_any_records(osm_bind_handle_t h, user.p_attr = attr; req.query_type = OSMV_QUERY_USER_DEFINED; - req.timeout_ms = sa_timeout_ms; + req.timeout_ms = ibd_timeout; req.retry_cnt = 1; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = NULL; @@ -888,7 +886,7 @@ get_print_path_rec_lid(osm_bind_handle_t h, memset(&req, 0, sizeof(req)); req.query_type = OSMV_QUERY_PATH_REC_BY_LIDS; - req.timeout_ms = sa_timeout_ms; + req.timeout_ms = ibd_timeout; req.retry_cnt = 1; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = NULL; @@ -926,7 +924,7 @@ get_print_path_rec_gid(osm_bind_handle_t h, memset(&req, 0, sizeof(req)); req.query_type = OSMV_QUERY_PATH_REC_BY_GIDS; - req.timeout_ms = sa_timeout_ms; + req.timeout_ms = ibd_timeout; req.retry_cnt = 1; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = NULL; @@ -958,7 +956,7 @@ static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h) memset(&req, 0, sizeof(req)); req.query_type = OSMV_QUERY_CLASS_PORT_INFO; - req.timeout_ms = sa_timeout_ms; + req.timeout_ms = ibd_timeout; req.retry_cnt = 1; req.flags = OSM_SA_FLAGS_SYNC; req.query_context = NULL; @@ -1401,10 +1399,10 @@ static osm_bind_handle_t get_bind_handle(void) exit(-1); } osm_log_set_level(&log_osm, OSM_LOG_NONE); - if (osm_debug) + if (ibdebug) osm_log_set_level(&log_osm, OSM_LOG_DEFAULT_LEVEL); - vendor = osm_vendor_new(&log_osm, sa_timeout_ms); + vendor = osm_vendor_new(&log_osm, ibd_timeout); osm_mad_pool_construct(&mad_pool); if ((status = osm_mad_pool_init(&mad_pool)) != IB_SUCCESS) { fprintf(stderr, "Failed to init mad pool: %s\n", @@ -1511,60 +1509,6 @@ static const struct query_cmd *find_query_by_type(ib_net16_t type) return NULL; } -static void usage(void) -{ - const struct query_cmd *q; - - fprintf(stderr, "Usage: %s [-h -d -p -N] [--list | -D] [-S -I -L -l -G" - " -O -U -c -s -g -m --src-to-dst --sgid-to-dgid " - "-C -P -t(imeout) ] [query-name] [ | | ]\n", - argv0); - fprintf(stderr, " Queries node records by default\n"); - fprintf(stderr, " -d enable debugging\n"); - fprintf(stderr, " -p get PathRecord info\n"); - fprintf(stderr, " -N get NodeRecord info\n"); - fprintf(stderr, " --list | -D the node desc of the CA's\n"); - fprintf(stderr, " -S get ServiceRecord info\n"); - fprintf(stderr, " -I get InformInfoRecord (subscription) info\n"); - fprintf(stderr, " -L return the Lids of the name specified\n"); - fprintf(stderr, " -l return the unique Lid of the name specified\n"); - fprintf(stderr, " -G return the Guids of the name specified\n"); - fprintf(stderr, " -O return name for the Lid specified\n"); - fprintf(stderr, " -U return name for the Guid specified\n"); - fprintf(stderr, " -c get the SA's class port info\n"); - fprintf(stderr, " -s return the PortInfoRecords with isSM or " - "isSMdisabled capability mask bit on\n"); - fprintf(stderr, " -g get multicast group info\n"); - fprintf(stderr, " -m get multicast member info\n"); - fprintf(stderr, " (if multicast group specified, list member GIDs" - " only for group specified\n"); - fprintf(stderr, " specified, for example 'saquery -m 0xC000')\n"); - fprintf(stderr, " -x get LinkRecord info\n"); - fprintf(stderr, " --src-to-dst get a PathRecord for \n" - " where src and dst are either node " - "names or LIDs\n"); - fprintf(stderr, " --sgid-to-dgid get a PathRecord for \n" - " where sgid and dgid are addresses in " - "IPv6 format\n"); - fprintf(stderr, " -C specify the SA query HCA\n"); - fprintf(stderr, " -P specify the SA query port\n"); - fprintf(stderr, " --smkey specify SM_Key value for the query." - " If non-numeric value \n" - " (like 'x') is specified then " - "saquery will prompt for a value\n"); - fprintf(stderr, " -t | --timeout specify the SA query " - "response timeout (default %u msec)\n", DEFAULT_SA_TIMEOUT_MS); - fprintf(stderr, - " --node-name-map specify a node name map\n"); - fprintf(stderr, "\n Supported query names (and aliases):\n"); - for (q = query_cmds; q->name; q++) - fprintf(stderr, " %s (%s) %s\n", q->name, - q->alias ? q->alias : "", q->usage ? q->usage : ""); - fprintf(stderr, "\n"); - - exit(-1); -} - enum saquery_command { SAQUERY_CMD_QUERY, SAQUERY_CMD_NODE_RECORD, @@ -1575,164 +1519,170 @@ enum saquery_command { SAQUERY_CMD_MCMEMBERS, }; +static enum saquery_command command = SAQUERY_CMD_QUERY; +static ib_net16_t query_type; +static char *src, *dst, *sgid, *dgid; + +static int process_opt(void *context, int ch, char *optarg) +{ + switch (ch) { + case 1: + { + char *opt = strdup(optarg); + char *ch = strchr(opt, ':'); + if (!ch) { + fprintf(stderr, + "ERROR: --src-to-dst :\n"); + ibdiag_show_usage(); + } + *ch++ = '\0'; + if (*opt) + src = strdup(opt); + if (*ch) + dst = strdup(ch); + free(opt); + command = SAQUERY_CMD_PATH_RECORD; + break; + } + case 2: + { + char *opt = strdup(optarg); + char *tok1 = strtok(opt, "-"); + char *tok2 = strtok(NULL, "\0"); + + if (tok1 && tok2) { + sgid = strdup(tok1); + dgid = strdup(tok2); + } else { + fprintf(stderr, + "ERROR: --sgid-to-dgid -\n"); + ibdiag_show_usage(); + } + free(opt); + command = SAQUERY_CMD_PATH_RECORD; + break; + } + case 3: + node_name_map_file = strdup(optarg); + break; + case 4: + if (!isxdigit(*optarg) && + !(optarg = getpass("SM_Key: "))) { + fprintf(stderr, "cannot get SM_Key\n"); + ibdiag_show_usage(); + } + smkey = cl_hton64(strtoull(optarg, NULL, 0)); + break; + case 'p': + command = SAQUERY_CMD_PATH_RECORD; + break; + case 'D': + node_print_desc = ALL_DESC; + break; + case 'c': + command = SAQUERY_CMD_CLASS_PORT_INFO; + break; + case 'S': + query_type = IB_MAD_ATTR_SERVICE_RECORD; + break; + case 'I': + query_type = IB_MAD_ATTR_INFORM_INFO_RECORD; + break; + case 'N': + command = SAQUERY_CMD_NODE_RECORD; + break; + case 'L': + node_print_desc = LID_ONLY; + break; + case 'l': + node_print_desc = UNIQUE_LID_ONLY; + break; + case 'G': + node_print_desc = GUID_ONLY; + break; + case 'O': + node_print_desc = NAME_OF_LID; + break; + case 'U': + node_print_desc = NAME_OF_GUID; + break; + case 's': + command = SAQUERY_CMD_ISSM; + break; + case 'g': + command = SAQUERY_CMD_MCGROUPS; + break; + case 'm': + command = SAQUERY_CMD_MCMEMBERS; + break; + case 'x': + query_type = IB_MAD_ATTR_LINK_RECORD; + break; + default: + return -1; + } + return 0; +} + int main(int argc, char **argv) { - int ch = 0; + char usage_args[1024]; osm_bind_handle_t h; - enum saquery_command command = SAQUERY_CMD_QUERY; - const struct query_cmd *q = NULL; - char *src = NULL, *dst = NULL; - char *sgid = NULL, *dgid = NULL; - ib_net16_t query_type = 0; + const struct query_cmd *q; ib_net16_t src_lid, dst_lid; ib_api_status_t status; - - static char const str_opts[] = "pVNDLlGOUcSIsgmxdhP:C:t:"; - static const struct option long_opts[] = { - {"p", 0, 0, 'p'}, - {"Version", 0, 0, 'V'}, - {"N", 0, 0, 'N'}, - {"L", 0, 0, 'L'}, - {"l", 0, 0, 'l'}, - {"G", 0, 0, 'G'}, - {"O", 0, 0, 'O'}, - {"U", 0, 0, 'U'}, - {"s", 0, 0, 's'}, - {"g", 0, 0, 'g'}, - {"m", 0, 0, 'm'}, - {"x", 0, 0, 'x'}, - {"d", 0, 0, 'd'}, - {"c", 0, 0, 'c'}, - {"S", 0, 0, 'S'}, - {"I", 0, 0, 'I'}, - {"P", 1, 0, 'P'}, - {"C", 1, 0, 'C'}, - {"help", 0, 0, 'h'}, - {"list", 0, 0, 'D'}, - {"src-to-dst", 1, 0, 1}, - {"sgid-to-dgid", 1, 0, 2}, - {"timeout", 1, 0, 't'}, - {"node-name-map", 1, 0, 3}, - {"smkey", 1, 0, 4}, + int n; + + const struct ibdiag_opt opts[] = { + {"p", 'p', 0, NULL, "get PathRecord info"}, + {"N", 'N', 0, NULL, "get NodeRecord info"}, + {"L", 'L', 0, NULL, "return the Lids of the name specified"}, + {"l", 'l', 0, NULL, "return the unique Lid of the name specified"}, + {"G", 'G', 0, NULL, "return the Guids of the name specified"}, + {"O", 'O', 0, NULL, "return name for the Lid specified"}, + {"U", 'U', 0, NULL, "return name for the Guid specified"}, + {"s", 's', 0, NULL, "return the PortInfoRecords with isSM or" + " isSMdisabled capability mask bit on"}, + {"g", 'g', 0, NULL, "get multicast group info"}, + {"m", 'm', 0, NULL, "get multicast member info (if multicast" + " group specified, list member GIDs only for group specified," + " for example 'saquery -m 0xC000')"}, + {"x", 'x', 0, NULL, "get LinkRecord info"}, + {"c", 'c', 0, NULL, "get the SA's class port info"}, + {"S", 'S', 0, NULL, "get ServiceRecord info"}, + {"I", 'I', 0, NULL, "get InformInfoRecord (subscription) info"}, + {"list", 'D', 0, NULL, "the node desc of the CA's"}, + {"src-to-dst", 1, 1, "", "get a PathRecord for" + " where src and dst are either node names or LIDs"}, + {"sgid-to-dgid", 2, 1, "", "get a PathRecord for" + " where sgid and dgid are addresses in IPv6 format"}, + {"node-name-map", 3, 1, "", "specify a node name map file"}, + {"smkey", 4, 1, "", "SA SM_Key value for the query." + " If non-numeric value (like 'x') is specified then" + " saquery will prompt for a value"}, {} }; - argv0 = argv[0]; - - while ((ch = getopt_long(argc, argv, str_opts, long_opts, NULL)) != -1) { - switch (ch) { - case 1: - { - char *opt = strdup(optarg); - char *ch = strchr(opt, ':'); - if (!ch) { - fprintf(stderr, - "ERROR: --src-to-dst :\n"); - usage(); - } - *ch++ = '\0'; - if (*opt) - src = strdup(opt); - if (*ch) - dst = strdup(ch); - free(opt); - command = SAQUERY_CMD_PATH_RECORD; - break; - } - case 2: - { - char *opt = strdup(optarg); - char *tok1 = strtok(opt, "-"); - char *tok2 = strtok(NULL, "\0"); - - if (tok1 && tok2) { - sgid = strdup(tok1); - dgid = strdup(tok2); - } else { - fprintf(stderr, - "ERROR: --sgid-to-dgid -\n"); - usage(); - } - free(opt); - command = SAQUERY_CMD_PATH_RECORD; - break; - } - case 3: - node_name_map_file = strdup(optarg); - break; - case 4: - if (!isxdigit(*optarg) && - !(optarg = getpass("SM_Key: "))) { - fprintf(stderr, "cannot get SM_Key\n"); - usage(); - } - smkey = cl_hton64(strtoull(optarg, NULL, 0)); - break; - case 'p': - command = SAQUERY_CMD_PATH_RECORD; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version()); + n = sprintf(usage_args, "[query-name] [ | | ]\n" + "\nSupported query names (and aliases):\n"); + for (q = query_cmds; q->name; q++) { + n += snprintf(usage_args + n, sizeof(usage_args) - n, + " %s (%s) %s\n", q->name, + q->alias ? q->alias : "", + q->usage ? q->usage : ""); + if (n >= sizeof(usage_args)) exit(-1); - case 'D': - node_print_desc = ALL_DESC; - break; - case 'c': - command = SAQUERY_CMD_CLASS_PORT_INFO; - break; - case 'S': - query_type = IB_MAD_ATTR_SERVICE_RECORD; - break; - case 'I': - query_type = IB_MAD_ATTR_INFORM_INFO_RECORD; - break; - case 'N': - command = SAQUERY_CMD_NODE_RECORD; - break; - case 'L': - node_print_desc = LID_ONLY; - break; - case 'l': - node_print_desc = UNIQUE_LID_ONLY; - break; - case 'G': - node_print_desc = GUID_ONLY; - break; - case 'O': - node_print_desc = NAME_OF_LID; - break; - case 'U': - node_print_desc = NAME_OF_GUID; - break; - case 's': - command = SAQUERY_CMD_ISSM; - break; - case 'g': - command = SAQUERY_CMD_MCGROUPS; - break; - case 'm': - command = SAQUERY_CMD_MCMEMBERS; - break; - case 'x': - query_type = IB_MAD_ATTR_LINK_RECORD; - break; - case 'd': - osm_debug = 1; - break; - case 'C': - sa_hca_name = optarg; - break; - case 'P': - sa_port_num = strtoul(optarg, NULL, 0); - break; - case 't': - sa_timeout_ms = strtoul(optarg, NULL, 0); - break; - case 'h': - default: - usage(); - } } + snprintf(usage_args + n, sizeof(usage_args) - n, + "\n Queries node records by default."); + + q = NULL; + ibd_timeout = DEFAULT_SA_TIMEOUT_MS; + + ibdiag_process_opts(argc, argv, NULL, "DLGs", opts, process_opt, + usage_args, NULL); + + argv0 = argv[0]; argc -= optind; argv += optind; @@ -1762,23 +1712,23 @@ int main(int argc, char **argv) node_print_desc == UNIQUE_LID_ONLY || node_print_desc == GUID_ONLY) && !requested_name) { fprintf(stderr, "ERROR: name not specified\n"); - usage(); + ibdiag_show_usage(); } if (node_print_desc == NAME_OF_LID && !requested_lid_flag) { fprintf(stderr, "ERROR: lid not specified\n"); - usage(); + ibdiag_show_usage(); } if (node_print_desc == NAME_OF_GUID && !requested_guid_flag) { fprintf(stderr, "ERROR: guid not specified\n"); - usage(); + ibdiag_show_usage(); } /* Note: lid cannot be 0; see infiniband spec 4.1.3 */ if (node_print_desc == NAME_OF_LID && !requested_lid) { fprintf(stderr, "ERROR: lid invalid\n"); - usage(); + ibdiag_show_usage(); } h = get_bind_handle(); diff --git a/infiniband-diags/src/sminfo.c b/infiniband-diags/src/sminfo.c index f7d09db..4cb7a6b 100644 --- a/infiniband-diags/src/sminfo.c +++ b/infiniband-diags/src/sminfo.c @@ -51,15 +51,6 @@ static uint8_t sminfo[1024]; char *argv0 = "sminfo"; -static void -usage(void) -{ - fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -s state -p prio -a activity -D(irect) -G(uid) -V(ersion) -C ca_name -P ca_port " - "-t(imeout) timeout_ms] [modifier]\n", - argv0); - exit(-1); -} - int strdata, xdata=1, bindata; enum { SMINFO_NOTACT, @@ -79,102 +70,60 @@ char *statestr[] = { #define STATESTR(s) (((unsigned)(s)) < SMINFO_STATE_LAST ? statestr[s] : "???") -int -main(int argc, char **argv) +static unsigned act; +static int prio, state = SMINFO_STANDBY; + +static int process_opt(void *context, int ch, char *optarg) +{ + switch (ch) { + case 'a': + act = strtoul(optarg, 0, 0); + break; + case 's': + state = strtoul(optarg, 0, 0); + break; + case 'p': + prio = strtoul(optarg, 0, 0); + break; + default: + return -1; + } + return 0; +} + +int main(int argc, char **argv) { int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; int mod = 0; ib_portid_t portid = {0}; - int timeout = 0; /* use default */ uint8_t *p; - unsigned act = 0; - int prio = 0, state = SMINFO_STANDBY; uint64_t guid = 0, key = 0; - int dest_type = IB_DEST_LID; - int udebug = 0; - char *ca = 0; - int ca_port = 0; - - static char const str_opts[] = "C:P:t:s:p:a:deDGVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "err_show", 0, 0, 'e'}, - { "s", 1, 0, 's'}, - { "p", 1, 0, 'p'}, - { "a", 1, 0, 'a'}, - { "Direct", 0, 0, 'D'}, - { "Guid", 0, 0, 'G'}, - { "Version", 0, 0, 'V'}, - { "timeout", 1, 0, 't'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, + + const struct ibdiag_opt opts[] = { + { "state", 's', 1, "<0-3>", "set SM state"}, + { "priority", 'p', 1, "<0-15>", "set SM priority"}, + { "activity", 'a', 1, NULL, "set activity count"}, { } }; + char usage_args[] = " [modifier]"; - argv0 = argv[0]; + ibdiag_process_opts(argc, argv, NULL, "s", opts, process_opt, + usage_args, NULL); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'a': - act = strtoul(optarg, 0, 0); - break; - case 's': - state = strtoul(optarg, 0, 0); - break; - case 'p': - prio = strtoul(optarg, 0, 0); - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + argv0 = argv[0]; argc -= optind; argv += optind; if (argc > 1) mod = atoi(argv[1]); - madrpc_init(ca, ca_port, mgmt_classes, 3); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], dest_type, 0) < 0) + if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, 0) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { - if (ib_resolve_smlid(&portid, timeout) < 0) + if (ib_resolve_smlid(&portid, ibd_timeout) < 0) IBERROR("can't resolve sm port %s", argv[0]); } @@ -185,10 +134,10 @@ main(int argc, char **argv) mad_encode_field(sminfo, IB_SMINFO_STATE_F, &state); if (mod) { - if (!(p = smp_set(sminfo, &portid, IB_ATTR_SMINFO, mod, timeout))) + if (!(p = smp_set(sminfo, &portid, IB_ATTR_SMINFO, mod, ibd_timeout))) IBERROR("query"); } else - if (!(p = smp_query(sminfo, &portid, IB_ATTR_SMINFO, 0, timeout))) + if (!(p = smp_query(sminfo, &portid, IB_ATTR_SMINFO, 0, ibd_timeout))) IBERROR("query"); mad_decode_field(sminfo, IB_SMINFO_GUID_F, &guid); diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c index a1a83c4..6299dba 100644 --- a/infiniband-diags/src/smpdump.c +++ b/infiniband-diags/src/smpdump.c @@ -53,8 +53,6 @@ static int mad_agent; static int drmad_tid = 0x123; -static int debug, verbose; - char *argv0 = "smpdump"; typedef struct { @@ -193,26 +191,29 @@ str2DRPath(char *str, DRPath *path) return path->hop_cnt; } -void -usage(void) +static int dump_char, mgmt_class = IB_SMI_CLASS; + +static int process_opt(void *context, int ch, char *optarg) { - fprintf(stderr, "Usage: %s [-s(ring) -D(irect) -V(ersion) -C ca_name -P ca_port -t(imeout) timeout_ms] [mod]\n", argv0); - fprintf(stderr, "\tDR examples:\n"); - fprintf(stderr, "\t\t%s -D 0,1,2,3,5 16 # NODE DESC\n", argv0); - fprintf(stderr, "\t\t%s -D 0,1,2 0x15 2 # PORT INFO, port 2\n", argv0); - fprintf(stderr, "\n\tLID routed examples:\n"); - fprintf(stderr, "\t\t%s 3 0x15 2 # PORT INFO, lid 3 port 2\n", argv0); - fprintf(stderr, "\t\t%s 0xa0 0x11 # NODE INFO, lid 0xa0\n", argv0); - fprintf(stderr, "\n"); - exit(-1); + switch (ch) { + case 's': + dump_char++; + break; + case 'D': + mgmt_class = IB_SMI_DIRECT_CLASS; + break; + case 'L': + mgmt_class = IB_SMI_CLASS; + break; + default: + return -1; + } + return 0; } -int -main(int argc, char *argv[]) +int main(int argc, char *argv[]) { - int dump_char = 0, timeout_ms = 1000; - int dev_port = 0, mgmt_class = IB_SMI_CLASS, dlid = 0; - char *dev_name = 0; + int dlid = 0; void *umad; struct drsmp *smp; int i, portid, mod = 0, attr; @@ -220,60 +221,32 @@ main(int argc, char *argv[]) uint8_t *desc; int length; - static char const str_opts[] = "C:P:t:dsDVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "sring", 0, 0, 's'}, - { "Direct", 0, 0, 'D'}, - { "timeout", 1, 0, 't'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, + const struct ibdiag_opt opts[] = { + { "sring", 's', 0, NULL, ""}, { } }; + char usage_args[] = " [mod]"; + const char *usage_examples[] = { + " -- DR routed examples:", + "%s -D 0,1,2,3,5 16 # NODE DESC", + "%s -D 0,1,2 0x15 2 # PORT INFO, port 2", + " -- LID routed examples:", + "%s 3 0x15 2 # PORT INFO, lid 3 port 2", + "%s 0xa0 0x11 # NODE INFO, lid 0xa0", + NULL + }; - argv0 = argv[0]; + ibd_timeout = 1000; - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 's': - dump_char++; - break; - case 'd': - debug++; - if (debug > 1) - umad_debug(debug-1); - break; - case 'D': - mgmt_class = IB_SMI_DIRECT_CLASS; - break; - case 'C': - dev_name = optarg; - break; - case 'P': - dev_port = atoi(optarg); - break; - case 't': - timeout_ms = strtoul(optarg, 0, 0); - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + ibdiag_process_opts(argc, argv, NULL, "Gs", opts, process_opt, + usage_args, usage_examples); + + argv0 = argv[0]; argc -= optind; argv += optind; if (argc < 2) - usage(); + ibdiag_show_usage(); if (mgmt_class == IB_SMI_DIRECT_CLASS && str2DRPath(strdupa(argv[0]), &path) < 0) @@ -289,8 +262,8 @@ main(int argc, char *argv[]) if (umad_init() < 0) IBPANIC("can't init UMAD library"); - if ((portid = umad_open_port(dev_name, dev_port)) < 0) - IBPANIC("can't open UMAD port (%s:%d)", dev_name, dev_port); + if ((portid = umad_open_port(ibd_ca, ibd_ca_port)) < 0) + IBPANIC("can't open UMAD port (%s:%d)", ibd_ca, ibd_ca_port); if ((mad_agent = umad_register(portid, mgmt_class, 1, 0, 0)) < 0) IBPANIC("Couldn't register agent for SMPs"); @@ -305,11 +278,11 @@ main(int argc, char *argv[]) else smp_get_init(umad, dlid, attr, mod); - if (debug > 1) + if (ibdebug > 1) xdump(stderr, "before send:\n", smp, 256); length = IB_MAD_SIZE; - if (umad_send(portid, mad_agent, umad, length, timeout_ms, 0) < 0) + if (umad_send(portid, mad_agent, umad, length, ibd_timeout, 0) < 0) IBPANIC("send failed"); if (umad_recv(portid, umad, &length, -1) != mad_agent) diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index 7a7dddf..c5916b8 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -53,12 +53,6 @@ #include "ibdiag_common.h" -#undef DEBUG -#define DEBUG if (verbose>1) IBWARN - -static int dest_type = IB_DEST_LID; -static int verbose; - typedef char *(op_fn_t)(ib_portid_t *dest, char **argv, int argc); typedef struct match_rec { @@ -392,132 +386,72 @@ match_op(char *name) return 0; } -static void -usage(void) +static int process_opt(void *context, int ch, char *optarg) { - char *basename; - const match_rec_t *r; - - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; - - fprintf(stderr, "Usage: %s [-d(ebug) -e(rr_show) -v(erbose) -D(irect) -G(uid) -s smlid -V(ersion) -C ca_name -P ca_port " - "-t(imeout) timeout_ms --node-name-map node-name-map] [op params]\n", - basename); - fprintf(stderr, "\tsupported ops:\n"); - for (r = match_tbl ; r->name ; r++) { - fprintf(stderr, "\t\t%s %s\n", r->name, - r->opt_portnum ? " []" : ""); + switch (ch) { + case 1: + node_name_map_file = strdup(optarg); + break; + case 'c': + ibd_dest_type = IB_DEST_DRSLID; + break; + default: + return -1; } - fprintf(stderr, "\n\texamples:\n"); - fprintf(stderr, "\t\t%s portinfo 3 1\t\t\t\t# portinfo by lid, with port modifier\n", basename); - fprintf(stderr, "\t\t%s -G switchinfo 0x2C9000100D051 1\t# switchinfo by guid\n", basename); - fprintf(stderr, "\t\t%s -D nodeinfo 0\t\t\t\t# nodeinfo by direct route\n", basename); - fprintf(stderr, "\t\t%s -c nodeinfo 6 0,12\t\t\t# nodeinfo by combined route\n", basename); - exit(-1); + return 0; } -int -main(int argc, char **argv) +int main(int argc, char **argv) { + char usage_args[1024]; int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; ib_portid_t portid = {0}; - ib_portid_t *sm_id = 0, sm_portid = {0}; - int timeout = 0, udebug = 0; - char *ca = 0; - int ca_port = 0; char *err; op_fn_t *fn; + const match_rec_t *r; + int n; - static char const str_opts[] = "C:P:t:s:devDcGVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "debug", 0, 0, 'd'}, - { "err_show", 0, 0, 'e'}, - { "verbose", 0, 0, 'v'}, - { "Direct", 0, 0, 'D'}, - { "combined", 0, 0, 'c'}, - { "Guid", 0, 0, 'G'}, - { "smlid", 1, 0, 's'}, - { "timeout", 1, 0, 't'}, - { "node-name-map", 1, 0, 1}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, - { } + const struct ibdiag_opt opts[] = { + { "combined", 'c', 0, NULL, "use Combined route address argument"}, + { "node-name-map", 1, 1, "", "node name map file"}, + {} + }; + const char *usage_examples[] = { + "portinfo 3 1\t\t\t\t# portinfo by lid, with port modifier", + "-G switchinfo 0x2C9000100D051 1\t# switchinfo by guid", + "-D nodeinfo 0\t\t\t\t# nodeinfo by direct route", + "-c nodeinfo 6 0,12\t\t\t# nodeinfo by combined route", + NULL }; - argv0 = argv[0]; - - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 1: - node_name_map_file = strdup(optarg); - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'e': - madrpc_show_errors(1); - break; - case 'D': - dest_type = IB_DEST_DRPATH; - break; - case 'c': - dest_type = IB_DEST_DRSLID; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'v': - verbose++; - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); + n = sprintf(usage_args, " [op params]\n" + "\nSupported ops:\n"); + for (r = match_tbl ; r->name ; r++) { + n += snprintf(usage_args + n, sizeof(usage_args) - n, + " %s %s\n", r->name, + r->opt_portnum ? " []" : ""); + if (n >= sizeof(usage_args)) exit(-1); - default: - usage(); - break; - } } + + ibdiag_process_opts(argc, argv, NULL, NULL, opts, process_opt, + usage_args, usage_examples); + + argv0 = argv[0]; argc -= optind; argv += optind; if (argc < 2) - usage(); + ibdiag_show_usage(); if (!(fn = match_op(argv[0]))) IBERROR("operation '%s' not supported", argv[0]); - madrpc_init(ca, ca_port, mgmt_classes, 3); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); node_name_map = open_node_name_map(node_name_map_file); - if (dest_type != IB_DEST_DRSLID) { - if (ib_resolve_portid_str(&portid, argv[1], dest_type, sm_id) < 0) + if (ibd_dest_type != IB_DEST_DRSLID) { + if (ib_resolve_portid_str(&portid, argv[1], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", argv[1]); if ((err = fn(&portid, argv+2, argc-2))) IBERROR("operation %s: %s", argv[0], err); @@ -526,7 +460,7 @@ main(int argc, char **argv) memset(concat, 0, 64); snprintf(concat, sizeof(concat), "%s %s", argv[1], argv[2]); - if (ib_resolve_portid_str(&portid, concat, dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&portid, concat, ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", concat); if ((err = fn(&portid, argv+3, argc-3))) IBERROR("operation %s: %s", argv[0], err); diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c index 61f9501..93b32c0 100644 --- a/infiniband-diags/src/vendstat.c +++ b/infiniband-diags/src/vendstat.c @@ -106,116 +106,60 @@ typedef struct { is3_record_t record[18]; } is3_config_space_t; -static void -usage(void) +static int general_info, xmit_wait = 0; + +static int process_opt(void *context, int ch, char *optarg) { - char *basename; - - if (!(basename = strrchr(argv0, '/'))) - basename = argv0; - else - basename++; - - fprintf(stderr, "Usage: %s [-d(ebug) -N -w -G(uid) -C ca_name -P ca_port " - "-t(imeout) timeout_ms -V(ersion) -h(elp)] \n", - basename); - fprintf(stderr, "\tExamples:\n"); - fprintf(stderr, "\t\t%s -N 6\t\t# read IS3 general information\n", basename); - fprintf(stderr, "\t\t%s -w 6\t\t# read IS3 port xmit wait counters\n", basename); - exit(-1); + switch (ch) { + case 'N': + general_info = 1; + break; + case 'w': + xmit_wait = 1; + break; + default: + return -1; + } + return 0; } -int -main(int argc, char **argv) +int main(int argc, char **argv) { int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_MLX_VENDOR_CLASS}; - ib_portid_t *sm_id = 0, sm_portid = {0}; ib_portid_t portid = {0}; - int dest_type = IB_DEST_LID; - int timeout = 0; /* use default */ int port = 0; char buf[1024]; - int udebug = 0; - char *ca = 0; - int ca_port = 0; ib_vendor_call_t call; is3_general_info_t *gi; is3_config_space_t *cs; - int general_info = 0; - int xmit_wait = 0; int i; - static char const str_opts[] = "C:P:s:t:dNwGVhu"; - static const struct option long_opts[] = { - { "C", 1, 0, 'C'}, - { "P", 1, 0, 'P'}, - { "N", 1, 0, 'N'}, - { "w", 1, 0, 'w'}, - { "debug", 0, 0, 'd'}, - { "Guid", 0, 0, 'G'}, - { "sm_portid", 1, 0, 's'}, - { "timeout", 1, 0, 't'}, - { "Version", 0, 0, 'V'}, - { "help", 0, 0, 'h'}, - { "usage", 0, 0, 'u'}, - { } + const struct ibdiag_opt opts[] = { + { "N", 'N', 0, NULL, "show IS3 general information"}, + { "w", 'w', 0, NULL, "show IS3 port xmit wait counters"}, + {} + }; + char usage_args[] = ""; + const char *usage_examples[] = { + "-N 6\t\t# read IS3 general information", + "-w 6\t\t# read IS3 port xmit wait counters", + NULL }; - argv0 = argv[0]; + ibdiag_process_opts(argc, argv, NULL, "D", opts, process_opt, + usage_args, usage_examples); - while (1) { - int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); - if ( ch == -1 ) - break; - switch(ch) { - case 'C': - ca = optarg; - break; - case 'P': - ca_port = strtoul(optarg, 0, 0); - break; - case 'N': - general_info = 1; - break; - case 'w': - xmit_wait = 1; - break; - case 'd': - ibdebug++; - madrpc_show_errors(1); - umad_debug(udebug); - udebug++; - break; - case 'G': - dest_type = IB_DEST_GUID; - break; - case 's': - if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) - IBERROR("can't resolve SM destination port %s", optarg); - sm_id = &sm_portid; - break; - case 't': - timeout = strtoul(optarg, 0, 0); - madrpc_set_timeout(timeout); - break; - case 'V': - fprintf(stderr, "%s %s\n", argv0, get_build_version() ); - exit(-1); - default: - usage(); - break; - } - } + argv0 = argv[0]; argc -= optind; argv += optind; if (argc > 1) port = strtoul(argv[1], 0, 0); - madrpc_init(ca, ca_port, mgmt_classes, 4); + madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 4); if (argc) { - if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) + if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) IBERROR("can't resolve destination port %s", argv[0]); } else { if (ib_resolve_self(&portid, &port, 0) < 0) @@ -235,7 +179,7 @@ main(int argc, char **argv) memset(&call, 0, sizeof(call)); call.mgmt_class = IB_MLX_VENDOR_CLASS; call.method = IB_MAD_METHOD_GET; - call.timeout = timeout; + call.timeout = ibd_timeout; memset(&buf, 0, sizeof(buf)); /* vendor ClassPortInfo is required attribute if class supported */ -- 1.6.0.4.766.g6fc4a From vlad at lists.openfabrics.org Sun Jan 25 03:59:03 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 25 Jan 2009 03:59:03 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090125-0200 daily build status Message-ID: <20090125115903.A582BE60DE1@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From tziporet at mellanox.co.il Sun Jan 25 05:10:48 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 25 Jan 2009 15:10:48 +0200 Subject: [ofa-general] OFED (EWG) meeting agenda for tomorrow (Jan 26) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com> These are the agenda items for the meeting tomorrow: 1. Decide on 1.4.1 release If yes - what is the scope of the release (suggestions: RH 5.3, SLES 11, RDS with iWARP, Open MPI 1.3) According to decisions we took in point releases we add only new OSes and critical bug fixes. Since some of the above features are not standing in this criteria we need to decide. 2. OFED 1.5 kernel base In last meeting we decided on 2.6.29. However there is a concern: 2.6.29 is already in release phase (RC2 is out), thus any new kernel code that we will develop will be posted for 2.6.30 only, and then we will need to back-port it if we will want to take it to 1.5. Thus it seems it is more reasonable to have the kernel base 2.6.30 3. OFED 1.5 schedule Betsy from Qlogic suggested to early the release. >From the other hand Olga from Voltaire asked to stay with the July time frame. Based on the decisions in 1 & 2 we should decide on the release schedule. Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From olga.shern at gmail.com Sun Jan 25 05:17:28 2009 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Sun, 25 Jan 2009 15:17:28 +0200 Subject: [ofa-general] ***SPAM*** Re: [ewg] OFED (EWG) meeting agenda for tomorrow (Jan 26) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com> Message-ID: > > 3. OFED 1.5 schedule > > Betsy from Qlogic suggested to early the release. > > From the other hand Olga from Voltaire asked to stay with the July time > frame. > > Based on the decisions in 1 & 2 we should decide on the release schedule. We should decide whether we want to have one or two OFED releases per year. If we will decide that we should go for one OFED release per year, I think we should postpone OFED 1.5 release to October. And have dot release in a middle. From dorfman.eli at gmail.com Sun Jan 25 07:13:54 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 25 Jan 2009 17:13:54 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH 1/5] opensm/osm_opensm.[ch] make setup and destroy routing engines functions global. In-Reply-To: <20090122083930.GQ3479@sashak.voltaire.com> References: <4975D824.6020607@gmail.com> <4975D964.4020901@gmail.com> <20090122083930.GQ3479@sashak.voltaire.com> Message-ID: <497C81B2.4060204@gmail.com> Sasha Khapyorsky wrote: > Hi Eli, > > On 16:02 Tue 20 Jan , Eli Dorfman (Voltaire) wrote: >> make setup and destroy routing engines functions global. >> change setup_routing_engines() and destroy_routing_engines() >> declaration > > How is it related to configuration update? Is it? > > I cannot see where it is used, if so why to make it global? it is used in osm_subnet.c (next patch in this set) after update of routing engine. it is needed in order to apply the new routing engine. > >> Signed-off-by: Eli Dorfman >> --- >> opensm/include/opensm/osm_opensm.h | 53 ++++++++++++++++++++++++++++++++++++ >> opensm/opensm/osm_opensm.c | 5 ++- >> 2 files changed, 56 insertions(+), 2 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h >> index c121be4..5b0a1dd 100644 >> --- a/opensm/include/opensm/osm_opensm.h >> +++ b/opensm/include/opensm/osm_opensm.h >> @@ -458,6 +458,59 @@ osm_opensm_wait_for_subnet_up(IN osm_opensm_t * const p_osm, >> * SEE ALSO >> *********/ >> >> +/****f* OpenSM: OpenSM/setup_routing_engines >> +* NAME >> +* setup_routing_engines >> +* >> +* DESCRIPTION >> +* This function constructs an routing engines. >> +* >> +* SYNOPSIS >> +*/ >> +void setup_routing_engines(osm_opensm_t *osm, const char *name); > > For public function names we are using 'osm_' prefix. > >> +/* >> +* PARAMETERS >> +* p_osm >> +* [in] Pointer to a OpenSM object to construct. >> +* >> +* name >> +* [in] Routing engine names. >> +* >> +* RETURN VALUE >> +* This function does not return a value. >> +* >> +* NOTES >> +* Setup of routing engines >> +* >> +* SEE ALSO >> +* destroy_routing_engines >> +*********/ >> + >> +/****f* OpenSM: OpenSM/destroy_routing_engines >> +* NAME >> +* destroy_routing_engines >> +* >> +* DESCRIPTION >> +* This function constructs an routing engines. >> +* >> +* SYNOPSIS >> +*/ >> +void destroy_routing_engines(osm_opensm_t *osm); > > Ditto. > > Sasha > >> +/* >> +* PARAMETERS >> +* p_osm >> +* [in] Pointer to a OpenSM object to construct. >> +* >> +* RETURN VALUE >> +* This function does not return a value. >> +* >> +* NOTES >> +* Setup of routing engines >> +* >> +* SEE ALSO >> +* setup_routing_engines >> +*********/ >> + >> /****f* OpenSM: OpenSM/osm_routing_engine_type_str >> * NAME >> * osm_routing_engine_type_str >> diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c >> index 7de2e5b..8ecb942 100644 >> --- a/opensm/opensm/osm_opensm.c >> +++ b/opensm/opensm/osm_opensm.c >> @@ -186,7 +186,7 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name) >> "cannot find or setup routing engine \'%s\'", name); >> } >> >> -static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) >> +void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) >> { >> char *name, *str, *p; >> >> @@ -224,7 +224,7 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm) >> >> /********************************************************************** >> **********************************************************************/ >> -static void destroy_routing_engines(osm_opensm_t *osm) >> +void destroy_routing_engines(osm_opensm_t *osm) >> { >> struct osm_routing_engine *r, *next; >> >> @@ -236,6 +236,7 @@ static void destroy_routing_engines(osm_opensm_t *osm) >> r->delete(r->context); >> free(r); >> } >> + osm->routing_engine_list = NULL; >> } >> >> /********************************************************************** >> -- >> 1.5.5 >> From dorfman.eli at gmail.com Sun Jan 25 07:15:52 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 25 Jan 2009 17:15:52 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH 3/5] opensm/osm_subnet.h put qos options flat below subnet opt In-Reply-To: <20090122102343.GT3479@sashak.voltaire.com> References: <4975D824.6020607@gmail.com> <4975D9E7.4070604@gmail.com> <20090122102343.GT3479@sashak.voltaire.com> Message-ID: <497C8228.6020602@gmail.com> Sasha Khapyorsky wrote: > Hi Eli, > > On 16:04 Tue 20 Jan , Eli Dorfman (Voltaire) wrote: >> put qos options flat below subnet opt >> put all qos option parameters (default, ca, sw, router) flat below subnet opt >> >> Signed-off-by: Eli Dorfman >> --- >> opensm/include/opensm/osm_subnet.h | 40 +++++++++++++++++++++++++++--------- >> 1 files changed, 30 insertions(+), 10 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h >> index 8863e47..692e449 100644 >> --- a/opensm/include/opensm/osm_subnet.h >> +++ b/opensm/include/opensm/osm_subnet.h >> @@ -99,11 +99,11 @@ struct osm_qos_policy; >> * SYNOPSIS >> */ >> typedef struct osm_qos_options { >> - unsigned max_vls; >> - int high_limit; >> - char *vlarb_high; >> - char *vlarb_low; >> - char *sl2vl; >> + unsigned qos_max_vls; >> + int qos_high_limit; >> + char *qos_vlarb_high; >> + char *qos_vlarb_low; >> + char *qos_sl2vl; >> } osm_qos_options_t; >> /* >> * FIELDS >> @@ -199,11 +199,31 @@ typedef struct osm_subn_opt { >> boolean_t daemon; >> boolean_t sm_inactive; >> boolean_t babbling_port_policy; >> - osm_qos_options_t qos_options; >> - osm_qos_options_t qos_ca_options; >> - osm_qos_options_t qos_sw0_options; >> - osm_qos_options_t qos_swe_options; >> - osm_qos_options_t qos_rtr_options; >> + unsigned qos_max_vls; >> + int qos_high_limit; >> + char *qos_vlarb_high; >> + char *qos_vlarb_low; >> + char *qos_sl2vl; >> + unsigned qos_ca_max_vls; >> + int qos_ca_high_limit; >> + char *qos_ca_vlarb_high; >> + char *qos_ca_vlarb_low; >> + char *qos_ca_sl2vl; >> + unsigned qos_sw0_max_vls; >> + int qos_sw0_high_limit; >> + char *qos_sw0_vlarb_high; >> + char *qos_sw0_vlarb_low; >> + char *qos_sw0_sl2vl; >> + unsigned qos_swe_max_vls; >> + int qos_swe_high_limit; >> + char *qos_swe_vlarb_high; >> + char *qos_swe_vlarb_low; >> + char *qos_swe_sl2vl; >> + unsigned qos_rtr_max_vls; >> + int qos_rtr_high_limit; >> + char *qos_rtr_vlarb_high; >> + char *qos_rtr_vlarb_low; >> + char *qos_rtr_sl2vl; > > Looking on patch 5 I think that I understand your motivation. However > I'm not sure that it is a good idea - sooner or later we will need to > support QoS port parameters setup configurable per port (and not just > per port type as now), so it would be desirable to preserve QoS port > parameter processing as whole block in general. > > Also I think you can use something like: > > { "qos_ca_max_vls", OPT_OFFSET(qos_ca_options.max_vls), ... }, > > in your array in patch 5 and preserve QoS configuration unchanged. ok, i'll fix that and resend the patch. > > Sasha > >> boolean_t enable_quirks; >> boolean_t no_clients_rereg; >> #ifdef ENABLE_OSM_PERF_MGR >> -- >> 1.5.5 >> From sashak at voltaire.com Sun Jan 25 07:17:18 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 25 Jan 2009 17:17:18 +0200 Subject: [ofa-general] [PATCH 1/5] opensm/osm_opensm.[ch] make setup and destroy routing engines functions global. In-Reply-To: <497C81B2.4060204@gmail.com> References: <4975D824.6020607@gmail.com> <4975D964.4020901@gmail.com> <20090122083930.GQ3479@sashak.voltaire.com> <497C81B2.4060204@gmail.com> Message-ID: <20090125151711.GA14755@sashak.voltaire.com> On 17:13 Sun 25 Jan , Eli Dorfman (Voltaire) wrote: > > > > On 16:02 Tue 20 Jan , Eli Dorfman (Voltaire) wrote: > >> make setup and destroy routing engines functions global. > >> change setup_routing_engines() and destroy_routing_engines() > >> declaration > > > > How is it related to configuration update? Is it? > > > > I cannot see where it is used, if so why to make it global? > > it is used in osm_subnet.c (next patch in this set) after update of routing engine. > it is needed in order to apply the new routing engine. Ok (I found this already - sorry, forgot to mention). Thanks for explanation. Sasha From dorfman.eli at gmail.com Sun Jan 25 07:18:24 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 25 Jan 2009 17:18:24 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/5] subnet configuration update In-Reply-To: <20090122084433.GR3479@sashak.voltaire.com> References: <4975D824.6020607@gmail.com> <20090122084433.GR3479@sashak.voltaire.com> Message-ID: <497C82C0.3020702@gmail.com> Sasha Khapyorsky wrote: > On 15:56 Tue 20 Jan , Eli Dorfman (Voltaire) wrote: >> The following patches are handling subnet configuration update. >> Subnet configuration parameters are rescanned every heavy sweep and if possible are updated. > > Patches 3 and 4 doesn't compile. Patch 5 doesn't apply and after fixing > doesn't compile too (don't resend yet, I want to look at this first). > very strange, i have compiled and tested it before sending to the list. will check again vs. that latest git. From sashak at voltaire.com Sun Jan 25 07:21:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 25 Jan 2009 17:21:03 +0200 Subject: [ofa-general] [PATCH 0/5] subnet configuration update In-Reply-To: <497C82C0.3020702@gmail.com> References: <4975D824.6020607@gmail.com> <20090122084433.GR3479@sashak.voltaire.com> <497C82C0.3020702@gmail.com> Message-ID: <20090125152103.GB14755@sashak.voltaire.com> On 17:18 Sun 25 Jan , Eli Dorfman (Voltaire) wrote: > > very strange, i have compiled and tested it before sending to the list. > will check again vs. that latest git. Patches 3-5 were not rebased to the current master. Patch 5 has also a syntax error (missed comma somewhere in opt_rec table). Sasha From sashak at voltaire.com Mon Jan 26 01:14:19 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 26 Jan 2009 11:14:19 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/smpquery: usage improvement In-Reply-To: <20090125115249.GB20419@sashak.voltaire.com> References: <20090125115136.GA20419@sashak.voltaire.com> <20090125115249.GB20419@sashak.voltaire.com> Message-ID: <20090126091419.GA5814@sashak.voltaire.com> This makes usage of smpquery operations more user friendly - similar to saquery each operation now has a shorter alias, string matching is case insensitive and abbreviations are allowed for both operation name and alias. And it is how this looks: Usage: smpquery [options] [op params] Supported operations (and aliases, case insensitive): NodeInfo (NI) NodeDesc (ND) PortInfo (PI) [] SwitchInfo (SI) PKeyTable (PKeys) [] SL2VLTable (SL2VL) [] VLArbitration (VLArb) [] GUIDInfo (GI) Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/smpquery.c | 32 +++++++++++++++++--------------- 1 files changed, 17 insertions(+), 15 deletions(-) diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c index 7dcf888..44280e1 100644 --- a/infiniband-diags/src/smpquery.c +++ b/infiniband-diags/src/smpquery.c @@ -54,7 +54,7 @@ typedef char *(op_fn_t)(ib_portid_t *dest, char **argv, int argc); typedef struct match_rec { - char *name; + const char *name, *alias; op_fn_t *fn; unsigned opt_portnum; } match_rec_t; @@ -63,14 +63,14 @@ static op_fn_t node_desc, node_info, port_info, switch_info, pkey_table, sl2vl_table, vlarb_table, guid_info; static const match_rec_t match_tbl[] = { - { "nodeinfo", node_info }, - { "nodedesc", node_desc }, - { "portinfo", port_info, 1 }, - { "switchinfo", switch_info }, - { "pkeys", pkey_table, 1 }, - { "sl2vl", sl2vl_table, 1 }, - { "vlarb", vlarb_table, 1 }, - { "guids", guid_info }, + { "NodeInfo", "NI", node_info }, + { "NodeDesc", "ND", node_desc }, + { "PortInfo", "PI", port_info, 1 }, + { "SwitchInfo", "SI", switch_info }, + { "PKeyTable", "PKeys", pkey_table, 1 }, + { "SL2VLTable", "SL2VL", sl2vl_table, 1 }, + { "VLArbitration", "VLArb", vlarb_table, 1 }, + { "GUIDInfo", "GI", guid_info }, {0} }; @@ -373,14 +373,15 @@ guid_info(ib_portid_t *dest, char **argv, int argc) return 0; } -static op_fn_t * -match_op(char *name) +static op_fn_t *match_op(char *name) { const match_rec_t *r; + unsigned len = strlen(name); for (r = match_tbl; r->name; r++) - if (!strcmp(r->name, name)) + if (!strncasecmp(r->name, name, len) || + (r->alias && !strncasecmp(r->alias, name, len))) return r->fn; - return 0; + return NULL; } static int process_opt(void *context, int ch, char *optarg) @@ -422,10 +423,11 @@ int main(int argc, char **argv) }; n = sprintf(usage_args, " [op params]\n" - "\nSupported ops:\n"); + "\nSupported ops (and aliases, case insensitive):\n"); for (r = match_tbl ; r->name ; r++) { n += snprintf(usage_args + n, sizeof(usage_args) - n, - " %s %s\n", r->name, + " %s (%s) %s\n", r->name, + r->alias ? r->alias : "", r->opt_portnum ? " []" : ""); if (n >= sizeof(usage_args)) exit(-1); -- 1.6.0.4.766.g6fc4a From tziporet at dev.mellanox.co.il Mon Jan 26 01:27:44 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 26 Jan 2009 11:27:44 +0200 Subject: [ofa-general] Re: [ewg] OFED (EWG) meeting agenda for tomorrow (Jan 26) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD018B89D5@mtlexch01.mtl.com> Message-ID: <497D8210.5040203@mellanox.co.il> Tziporet Koren wrote: > > These are the agenda items for the meeting tomorrow: > > 1. Decide on 1.4.1 release > > If yes - what is the scope of the release (suggestions: RH 5.3, SLES > 11, RDS with iWARP, Open MPI 1.3) > > According to decisions we took in point releases we add only new OSes > and critical bug fixes. > > Since some of the above features are not standing in this criteria we > need to decide. > > 2. OFED 1.5 kernel base > > In last meeting we decided on 2.6.29. However there is a concern: > 2.6.29 is already in release phase (RC2 is out), thus any new kernel > code that we will develop will be posted for 2.6.30 only, and then we > will need to back-port it if we will want to take it to 1.5. > > Thus it seems it is more reasonable to have the kernel base 2.6.30 > > 3. OFED 1.5 schedule > > Betsy from Qlogic suggested to early the release. > > From the other hand Olga from Voltaire asked to stay with the July > time frame. > > Based on the decisions in 1 & 2 we should decide on the release schedule. > One more item: Move to the new OFA server > > Tziporet > > ------------------------------------------------------------------------ > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From vlad at lists.openfabrics.org Mon Jan 26 03:16:32 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 26 Jan 2009 03:16:32 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090126-0200 daily build status Message-ID: <20090126111632.8012BE60F19@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From dorfman.eli at gmail.com Mon Jan 26 06:28:15 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 26 Jan 2009 16:28:15 +0200 Subject: [ofa-general] ***SPAM*** [PATCH 0/4] support subnet configuration update Message-ID: <497DC87F.2090308@gmail.com> The following patches are handling subnet configuration update. Subnet configuration parameters are rescanned every heavy sweep and if possible are updated. From hartlch14 at gmail.com Mon Jan 26 06:24:28 2009 From: hartlch14 at gmail.com (Chuck Hartley) Date: Mon, 26 Jan 2009 09:24:28 -0500 Subject: [ofa-general] ***SPAM*** Unable to get IPoIB working Message-ID: Hello, I just brought up two new machines with ConnectX adapters and have been unable to get them to work using IPoIB. They are the only two hosts on the network and are connected via a Mellanox MTS3600 switch. All are running OFED 1.4 on Fedora 9, and the latest firmware (2.6.0) on the HCAs. I am unable to ping from one adapter to the other: These are our first ConnectX HCAs. We have a number of hosts using InfiniHost HCAs, and IPoIB "just worked" without further configuration after OFED installation. # ping 172.16.0.70 PING 172.16.0.70 (172.16.0.70) 56(84) bytes of data. >From 172.16.0.71 icmp_seq=2 Destination Host Unreachable # arp -n 172.16.0.70 Address HWtype HWaddress Flags Mask Iface 172.16.0.70 (incomplete) ib0 # ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.16.0.71 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) The ifconfig output looks ok, but this output from ip seems to contradict it(?): # ip addr show ib0 4: ib0: mtu 65520 qdisc pfifo_fast state DOWN qlen 256 link/infiniband 80:00:04:04:fe:80:00:00:00:00:00:00:00:30:48:c6:4c:18:00:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 172.16.0.71/16 brd 172.16.255.255 scope global ib0 # saquery NodeRecord dump: lid.....................0x5 reserved................0x0 base_version............0x1 class_version...........0x1 node_type...............Channel Adapter num_ports...............0x2 sys_guid................0x0002c9030003360f node_guid...............0x0002c9030003360c port_guid...............0x0002c9030003360d partition_cap...........0x80 device_id...............0x673C revision................0xA0 port_num................0x1 vendor_id...............0x2C9 NodeDescription.........linux71 HCA-1 NodeRecord dump: lid.....................0x2 reserved................0x0 base_version............0x1 class_version...........0x1 node_type...............Switch num_ports...............0x24 sys_guid................0x0002c9020040536b node_guid...............0x0002c90200405368 port_guid...............0x0002c90200405368 partition_cap...........0x8 device_id...............0xBD36 revision................0xA0 port_num................0x1 vendor_id...............0x2C9 NodeDescription.........Infiniscale-IV Mellanox Technologies NodeRecord dump: lid.....................0x4 reserved................0x0 base_version............0x1 class_version...........0x1 node_type...............Channel Adapter num_ports...............0x2 sys_guid................0x0002c90300032de3 node_guid...............0x0002c90300032de0 port_guid...............0x0002c90300032de1 partition_cap...........0x80 device_id...............0x673C revision................0xA0 port_num................0x1 vendor_id...............0x2C9 NodeDescription.........linux70 HCA-1 Any ideas on what is going on here? Thanks for an help you can provide. Chuck From dorfman.eli at gmail.com Mon Jan 26 06:31:19 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 26 Jan 2009 16:31:19 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 1/4] opensm/osm_opensm.[ch] make setup and destroy routing engines fucntions global In-Reply-To: <497DC87F.2090308@gmail.com> References: <497DC87F.2090308@gmail.com> Message-ID: <497DC937.7020102@gmail.com> make setup and destroy routing engines fucntions global. change setup_routing_engines() and destroy_routing_engines() declaration Signed-off-by: Eli Dorfman --- opensm/include/opensm/osm_opensm.h | 53 ++++++++++++++++++++++++++++++++++++ opensm/opensm/osm_opensm.c | 5 ++- 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h index c121be4..5b0a1dd 100644 --- a/opensm/include/opensm/osm_opensm.h +++ b/opensm/include/opensm/osm_opensm.h @@ -458,6 +458,59 @@ osm_opensm_wait_for_subnet_up(IN osm_opensm_t * const p_osm, * SEE ALSO *********/ +/****f* OpenSM: OpenSM/setup_routing_engines +* NAME +* setup_routing_engines +* +* DESCRIPTION +* This function constructs an routing engines. +* +* SYNOPSIS +*/ +void setup_routing_engines(osm_opensm_t *osm, const char *name); +/* +* PARAMETERS +* p_osm +* [in] Pointer to a OpenSM object to construct. +* +* name +* [in] Routing engine names. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Setup of routing engines +* +* SEE ALSO +* destroy_routing_engines +*********/ + +/****f* OpenSM: OpenSM/destroy_routing_engines +* NAME +* destroy_routing_engines +* +* DESCRIPTION +* This function constructs an routing engines. +* +* SYNOPSIS +*/ +void destroy_routing_engines(osm_opensm_t *osm); +/* +* PARAMETERS +* p_osm +* [in] Pointer to a OpenSM object to construct. +* +* RETURN VALUE +* This function does not return a value. +* +* NOTES +* Setup of routing engines +* +* SEE ALSO +* setup_routing_engines +*********/ + /****f* OpenSM: OpenSM/osm_routing_engine_type_str * NAME * osm_routing_engine_type_str diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c index 7de2e5b..8ecb942 100644 --- a/opensm/opensm/osm_opensm.c +++ b/opensm/opensm/osm_opensm.c @@ -186,7 +186,7 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name) "cannot find or setup routing engine \'%s\'", name); } -static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) +void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) { char *name, *str, *p; @@ -224,7 +224,7 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm) /********************************************************************** **********************************************************************/ -static void destroy_routing_engines(osm_opensm_t *osm) +void destroy_routing_engines(osm_opensm_t *osm) { struct osm_routing_engine *r, *next; @@ -236,6 +236,7 @@ static void destroy_routing_engines(osm_opensm_t *osm) r->delete(r->context); free(r); } + osm->routing_engine_list = NULL; } /********************************************************************** -- 1.5.5 From dorfman.eli at gmail.com Mon Jan 26 06:32:15 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 26 Jan 2009 16:32:15 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 2/4] opensm/main.c rescan subnet configuration after SIGHUP In-Reply-To: <497DC87F.2090308@gmail.com> References: <497DC87F.2090308@gmail.com> Message-ID: <497DC96F.3000902@gmail.com> rescan subnet configuration after SIGHUP call osm_subn_rescan_conf_files() after SIGHUP. this is important when priority is changed and SM is in standby. in that case it will not send capability mask trap and will not become master. Signed-off-by: Eli Dorfman --- opensm/opensm/main.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index f786192..0f7b822 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -507,6 +507,7 @@ int osm_manager_loop(osm_subn_opt_t * p_opt, osm_opensm_t * p_osm) osm_hup_flag = 0; /* a HUP signal should only start a new heavy sweep */ p_osm->subn.force_heavy_sweep = TRUE; + osm_subn_rescan_conf_files(&p_osm->subn); osm_opensm_sweep(p_osm); } } -- 1.5.5 From dorfman.eli at gmail.com Mon Jan 26 06:33:26 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 26 Jan 2009 16:33:26 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 3/4] opensm/osm_log.c save log_max_size in subnet opt in MB In-Reply-To: <497DC87F.2090308@gmail.com> References: <497DC87F.2090308@gmail.com> Message-ID: <497DC9B6.5010200@gmail.com> save log_max_size in subnet opt in MB the max_size in the log object is converted to bytes. Signed-off-by: Eli Dorfman --- opensm/opensm/main.c | 5 ++--- opensm/opensm/osm_log.c | 2 +- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 0f7b822..de38056 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -778,9 +778,8 @@ int main(int argc, char *argv[]) break; case 'L': - opt.log_max_size = - strtoul(optarg, NULL, 0) * (1024 * 1024); - printf(" Log file max size is %lu bytes\n", + opt.log_max_size = strtoul(optarg, NULL, 0); + printf(" Log file max size is %lu MBytes\n", opt.log_max_size); break; diff --git a/opensm/opensm/osm_log.c b/opensm/opensm/osm_log.c index 88633ab..d5e1af6 100644 --- a/opensm/opensm/osm_log.c +++ b/opensm/opensm/osm_log.c @@ -306,7 +306,7 @@ ib_api_status_t osm_log_init_v2(IN osm_log_t * const p_log, p_log->level = log_flags; p_log->flush = flush; p_log->count = 0; - p_log->max_size = max_size; + p_log->max_size = max_size << 20; /* convert size in MB to bytes */ p_log->accum_log_file = accum_log_file; p_log->log_file_name = (char *)log_file; -- 1.5.5 From Robert at saq.co.uk Mon Jan 26 06:24:09 2009 From: Robert at saq.co.uk (Robert Dunkley) Date: Mon, 26 Jan 2009 14:24:09 -0000 Subject: [ofa-general] ***SPAM*** Unable to get IPoIB working References: Message-ID: Please post the results of ibstat. Rob -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Chuck Hartley Sent: 26 January 2009 14:24 To: OpenFabrics General Subject: [ofa-general] ***SPAM*** Unable to get IPoIB working Hello, I just brought up two new machines with ConnectX adapters and have been unable to get them to work using IPoIB. They are the only two hosts on the network and are connected via a Mellanox MTS3600 switch. All are running OFED 1.4 on Fedora 9, and the latest firmware (2.6.0) on the HCAs. I am unable to ping from one adapter to the other: These are our first ConnectX HCAs. We have a number of hosts using InfiniHost HCAs, and IPoIB "just worked" without further configuration after OFED installation. # ping 172.16.0.70 PING 172.16.0.70 (172.16.0.70) 56(84) bytes of data. >From 172.16.0.71 icmp_seq=2 Destination Host Unreachable # arp -n 172.16.0.70 Address HWtype HWaddress Flags Mask Iface 172.16.0.70 (incomplete) ib0 # ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.16.0.71 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) The ifconfig output looks ok, but this output from ip seems to contradict it(?): # ip addr show ib0 4: ib0: mtu 65520 qdisc pfifo_fast state DOWN qlen 256 link/infiniband 80:00:04:04:fe:80:00:00:00:00:00:00:00:30:48:c6:4c:18:00:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 172.16.0.71/16 brd 172.16.255.255 scope global ib0 # saquery NodeRecord dump: lid.....................0x5 reserved................0x0 base_version............0x1 class_version...........0x1 node_type...............Channel Adapter num_ports...............0x2 sys_guid................0x0002c9030003360f node_guid...............0x0002c9030003360c port_guid...............0x0002c9030003360d partition_cap...........0x80 device_id...............0x673C revision................0xA0 port_num................0x1 vendor_id...............0x2C9 NodeDescription.........linux71 HCA-1 NodeRecord dump: lid.....................0x2 reserved................0x0 base_version............0x1 class_version...........0x1 node_type...............Switch num_ports...............0x24 sys_guid................0x0002c9020040536b node_guid...............0x0002c90200405368 port_guid...............0x0002c90200405368 partition_cap...........0x8 device_id...............0xBD36 revision................0xA0 port_num................0x1 vendor_id...............0x2C9 NodeDescription.........Infiniscale-IV Mellanox Technologies NodeRecord dump: lid.....................0x4 reserved................0x0 base_version............0x1 class_version...........0x1 node_type...............Channel Adapter num_ports...............0x2 sys_guid................0x0002c90300032de3 node_guid...............0x0002c90300032de0 port_guid...............0x0002c90300032de1 partition_cap...........0x80 device_id...............0x673C revision................0xA0 port_num................0x1 vendor_id...............0x2C9 NodeDescription.........linux70 HCA-1 Any ideas on what is going on here? Thanks for an help you can provide. Chuck _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general The SAQ Group Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ SAQ is the trading name of SEMTEC Limited. Registered in England & Wales Company Number: 06481952 http://www.saqnet.co.uk AS29219 SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business. Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support. ISPA Member Find us in http://www.thebestof.co.uk/petersfield From dorfman.eli at gmail.com Mon Jan 26 06:34:36 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 26 Jan 2009 16:34:36 +0200 Subject: ***SPAM*** [ofa-general] [PATCH 4/4] opensm/osm_subnet.c support subnet configuration rescan and update In-Reply-To: <497DC87F.2090308@gmail.com> References: <497DC87F.2090308@gmail.com> Message-ID: <497DC9FC.2050907@gmail.com> support subnet configuration rescan and update subnet configuration parameters are rescanned every heavy sweep. every parameter is parsed by parse function according to its type. some params require special post update function to setup them. every parameter has also a flag that specifies whether it can be updated during runtime. Signed-off-by: Eli Dorfman --- opensm/opensm/osm_subnet.c | 685 +++++++++++++++++++++----------------------- 1 files changed, 330 insertions(+), 355 deletions(-) diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index 94b6332..39e989b 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -71,6 +71,188 @@ static const char null_str[] = "(null)"; +#define OPT_OFFSET(opt) offsetof(osm_subn_opt_t, opt) + +typedef void (setup_fn_t)(osm_subn_t *p_subn, void *p_val); +typedef void (parse_fn_t)(osm_subn_t *p_subn, char *p_key, char *p_val_str, void *p_val, setup_fn_t *f); + +typedef struct opt_rec { + const char *name; + uint32_t opt_offset; + parse_fn_t *parse_fn; + setup_fn_t *setup_fn; + int can_update; +} opt_rec_t; + +static parse_fn_t opts_parse_uint8, opts_parse_uint16, opts_parse_net16, opts_parse_uint32, + opts_parse_int32, opts_parse_net64, opts_parse_charp, opts_parse_boolean; + +static setup_fn_t opts_setup_log_flags, opts_setup_log_max_size, + opts_setup_force_log_flush, opts_setup_accum_log_file, + opts_setup_sminfo_polling_timeout, opts_setup_routing_engine, + opts_setup_sm_priority; + +static const opt_rec_t opt_tbl[] = { + { "guid", OPT_OFFSET(guid), opts_parse_net64, NULL, 0 }, + { "m_key", OPT_OFFSET(m_key), opts_parse_net64, NULL, 1 }, + { "sm_key", OPT_OFFSET(sm_key), opts_parse_net64, NULL, 1 }, + { "sa_key", OPT_OFFSET(sa_key), opts_parse_net64, NULL, 1 }, + { "subnet_prefix", OPT_OFFSET(subnet_prefix), opts_parse_net64, NULL, 1 }, + { "m_key_lease_period", OPT_OFFSET(m_key_lease_period), opts_parse_net16, NULL, 1 }, + { "sweep_interval", OPT_OFFSET(sweep_interval), opts_parse_uint32, NULL, 1 }, + { "max_wire_smps", OPT_OFFSET(max_wire_smps), opts_parse_uint32, NULL, 1 }, + { "console", OPT_OFFSET(console), opts_parse_charp, NULL, 0 }, + { "console_port", OPT_OFFSET(console_port), opts_parse_uint16, NULL, 0 }, + { "transaction_timeout", OPT_OFFSET(transaction_timeout), opts_parse_uint32, NULL, 1 }, + { "max_msg_fifo_timeout", OPT_OFFSET(max_msg_fifo_timeout), opts_parse_uint32, NULL, 1 }, + { "sm_priority", OPT_OFFSET(sm_priority), opts_parse_uint8, opts_setup_sm_priority, 1 }, + { "lmc", OPT_OFFSET(lmc), opts_parse_uint8, NULL, 1 }, + { "lmc_esp0", OPT_OFFSET(lmc_esp0), opts_parse_boolean, NULL, 1 }, + { "max_op_vls", OPT_OFFSET(max_op_vls), opts_parse_uint8, NULL, 1 }, + { "force_link_speed", OPT_OFFSET(force_link_speed), opts_parse_uint8, NULL, 1 }, + { "reassign_lids", OPT_OFFSET(reassign_lids), opts_parse_boolean, NULL, 1 }, + { "ignore_other_sm", OPT_OFFSET(ignore_other_sm), opts_parse_boolean, NULL, 1 }, + { "single_thread", OPT_OFFSET(single_thread), opts_parse_boolean, NULL, 0 }, + { "disable_multicast", OPT_OFFSET(disable_multicast), opts_parse_boolean, NULL, 1 }, + { "subnet_timeout", OPT_OFFSET(subnet_timeout), opts_parse_uint8, NULL, 1 }, + { "packet_life_time", OPT_OFFSET(packet_life_time), opts_parse_uint8, NULL, 1 }, + { "vl_stall_count", OPT_OFFSET(vl_stall_count), opts_parse_uint8, NULL, 1 }, + { "leaf_vl_stall_count", OPT_OFFSET(leaf_vl_stall_count), opts_parse_uint8, NULL, 1 }, + { "head_of_queue_lifetime", OPT_OFFSET(head_of_queue_lifetime), opts_parse_uint8, NULL, 1 }, + { "leaf_head_of_queue_lifetime", OPT_OFFSET(leaf_head_of_queue_lifetime), opts_parse_uint8, NULL, 1 }, + { "local_phy_errors_threshold", OPT_OFFSET(local_phy_errors_threshold), opts_parse_uint8, NULL, 1 }, + { "overrun_errors_threshold", OPT_OFFSET(overrun_errors_threshold), opts_parse_uint8, NULL, 1 }, + { "sminfo_polling_timeout", OPT_OFFSET(sminfo_polling_timeout), opts_parse_uint32, opts_setup_sminfo_polling_timeout, 1 }, + { "polling_retry_number", OPT_OFFSET(polling_retry_number), opts_parse_uint32, NULL, 1 }, + { "force_heavy_sweep", OPT_OFFSET(force_heavy_sweep), opts_parse_boolean, NULL, 1 }, + { "port_prof_ignore_file", OPT_OFFSET(port_prof_ignore_file), opts_parse_charp, NULL, 1 }, + { "port_profile_switch_nodes", OPT_OFFSET(port_profile_switch_nodes), opts_parse_boolean, NULL, 1 }, + { "sweep_on_trap", OPT_OFFSET(sweep_on_trap), opts_parse_boolean, NULL, 1 }, + { "routing_engine", OPT_OFFSET(routing_engine_names), opts_parse_charp, opts_setup_routing_engine, 1 }, + { "connect_roots", OPT_OFFSET(connect_roots), opts_parse_boolean, NULL, 1 }, + { "use_ucast_cache", OPT_OFFSET(use_ucast_cache), opts_parse_boolean, NULL, 1 }, + { "log_file", OPT_OFFSET(log_file), opts_parse_charp, NULL, 0 }, + { "log_max_size", OPT_OFFSET(log_max_size), opts_parse_uint32, opts_setup_log_max_size }, + { "log_flags", OPT_OFFSET(log_flags), opts_parse_uint8, opts_setup_log_flags, 1 }, + { "force_log_flush", OPT_OFFSET(force_log_flush), opts_parse_boolean, opts_setup_force_log_flush, 1 }, + { "accum_log_file", OPT_OFFSET(accum_log_file), opts_parse_boolean, opts_setup_accum_log_file, 1 }, + { "partition_config_file", OPT_OFFSET(partition_config_file), opts_parse_charp, NULL, 1 }, + { "no_partition_enforcement", OPT_OFFSET(no_partition_enforcement), opts_parse_boolean, NULL, 1 }, + { "qos", OPT_OFFSET(qos), opts_parse_boolean, NULL, 1 }, + { "qos_policy_file", OPT_OFFSET(qos_policy_file), opts_parse_charp, NULL, 1 }, + { "dump_files_dir", OPT_OFFSET(dump_files_dir), opts_parse_charp, NULL, 1 }, + { "lid_matrix_dump_file", OPT_OFFSET(lid_matrix_dump_file), opts_parse_charp, NULL, 1 }, + { "lfts_file", OPT_OFFSET(lfts_file), opts_parse_charp, NULL, 1 }, + { "root_guid_file", OPT_OFFSET(root_guid_file), opts_parse_charp, NULL, 1 }, + { "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 1 }, + { "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 1 }, + { "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file), opts_parse_charp, NULL, 1 }, + { "sa_db_file", OPT_OFFSET(sa_db_file), opts_parse_charp, NULL, 1 }, + { "do_mesh_analysis", OPT_OFFSET(do_mesh_analysis), opts_parse_boolean, NULL, 1 }, + { "exit_on_fatal", OPT_OFFSET(exit_on_fatal), opts_parse_boolean, NULL, 1 }, + { "honor_guid2lid_file", OPT_OFFSET(honor_guid2lid_file), opts_parse_boolean, NULL, 1 }, + { "daemon", OPT_OFFSET(daemon), opts_parse_boolean, NULL, 0 }, + { "sm_inactive", OPT_OFFSET(sm_inactive), opts_parse_boolean, NULL, 1 }, + { "babbling_port_policy", OPT_OFFSET(babbling_port_policy), opts_parse_boolean, NULL, 1 }, + +#ifdef ENABLE_OSM_PERF_MGR + { "perfmgr", OPT_OFFSET(perfmgr), opts_parse_boolean, NULL, 0 }, + { "perfmgr_redir", OPT_OFFSET(perfmgr_redir), opts_parse_boolean, NULL, 0 }, + { "perfmgr_sweep_time_s", OPT_OFFSET(perfmgr_sweep_time_s), opts_parse_uint16, NULL, 0 }, + { "perfmgr_max_outstanding_queries", OPT_OFFSET(perfmgr_max_outstanding_queries), opts_parse_uint32, NULL, 0 }, + { "event_db_dump_file", OPT_OFFSET(event_db_dump_file), opts_parse_charp, NULL, 0 }, +#endif /* ENABLE_OSM_PERF_MGR */ + + { "event_plugin_name", OPT_OFFSET(event_plugin_name), opts_parse_charp, NULL, 0 }, + { "node_name_map_name", OPT_OFFSET(node_name_map_name), opts_parse_charp, NULL, 0 }, + + { "qos_max_vls", OPT_OFFSET(qos_options.max_vls), opts_parse_uint32, NULL, 1 }, + { "qos_high_limit", OPT_OFFSET(qos_options.high_limit), opts_parse_int32, NULL, 1 }, + { "qos_vlarb_high", OPT_OFFSET(qos_options.vlarb_high), opts_parse_charp, NULL, 1 }, + { "qos_vlarb_low", OPT_OFFSET(qos_options.vlarb_low), opts_parse_charp, NULL, 1 }, + { "qos_sl2vl", OPT_OFFSET(qos_options.sl2vl), opts_parse_charp, NULL, 1 }, + + { "qos_ca_max_vls", OPT_OFFSET(qos_ca_options.max_vls), opts_parse_uint32, NULL, 1 }, + { "qos_ca_high_limit", OPT_OFFSET(qos_ca_options.high_limit), opts_parse_int32, NULL, 1 }, + { "qos_ca_vlarb_high", OPT_OFFSET(qos_ca_options.vlarb_high), opts_parse_charp, NULL, 1 }, + { "qos_ca_vlarb_low", OPT_OFFSET(qos_ca_options.vlarb_low), opts_parse_charp, NULL, 1 }, + { "qos_ca_sl2vl", OPT_OFFSET(qos_ca_options.sl2vl), opts_parse_charp, NULL, 1 }, + + { "qos_sw0_max_vls", OPT_OFFSET(qos_sw0_options.max_vls), opts_parse_uint32, NULL, 1 }, + { "qos_sw0_high_limit", OPT_OFFSET(qos_sw0_options.high_limit), opts_parse_int32, NULL, 1 }, + { "qos_sw0_vlarb_high", OPT_OFFSET(qos_sw0_options.vlarb_high), opts_parse_charp, NULL, 1 }, + { "qos_sw0_vlarb_low", OPT_OFFSET(qos_sw0_options.vlarb_low), opts_parse_charp, NULL, 1 }, + { "qos_sw0_sl2vl", OPT_OFFSET(qos_sw0_options.sl2vl), opts_parse_charp, NULL, 1 }, + + { "qos_swe_max_vls", OPT_OFFSET(qos_swe_options.max_vls), opts_parse_uint32, NULL, 1 }, + { "qos_swe_high_limit", OPT_OFFSET(qos_swe_options.high_limit), opts_parse_int32, NULL, 1 }, + { "qos_swe_vlarb_high", OPT_OFFSET(qos_swe_options.vlarb_high), opts_parse_charp, NULL, 1 }, + { "qos_swe_vlarb_low", OPT_OFFSET(qos_swe_options.vlarb_low), opts_parse_charp, NULL, 1 }, + { "qos_swe_sl2vl", OPT_OFFSET(qos_swe_options.sl2vl), opts_parse_charp, NULL, 1 }, + + { "qos_rtr_max_vls", OPT_OFFSET(qos_rtr_options.max_vls), opts_parse_uint32, NULL, 1 }, + { "qos_rtr_high_limit", OPT_OFFSET(qos_rtr_options.high_limit), opts_parse_int32, NULL, 1 }, + { "qos_rtr_vlarb_high", OPT_OFFSET(qos_rtr_options.vlarb_high), opts_parse_charp, NULL, 1 }, + { "qos_rtr_vlarb_low", OPT_OFFSET(qos_rtr_options.vlarb_low), opts_parse_charp, NULL, 1 }, + { "qos_rtr_sl2vl", OPT_OFFSET(qos_rtr_options.sl2vl), opts_parse_charp, NULL, 1 }, + + { "enable_quirks", OPT_OFFSET(enable_quirks), opts_parse_boolean, NULL, 1 }, + { "no_clients_rereg", OPT_OFFSET(no_clients_rereg), opts_parse_boolean, NULL, 1 }, + { "prefix_routes_file", OPT_OFFSET(prefix_routes_file), opts_parse_charp, NULL, 1 }, + { "consolidate_ipv6_snm_req", OPT_OFFSET(consolidate_ipv6_snm_req), opts_parse_boolean, NULL, 1 }, + {0} +}; + + +static void opts_setup_log_flags(osm_subn_t *p_subn, void *p_val) +{ + p_subn->p_osm->log.level = *((uint8_t *) p_val); +} + +static void opts_setup_force_log_flush(osm_subn_t *p_subn, void *p_val) +{ + p_subn->p_osm->log.flush = *((boolean_t *) p_val); +} + +static void opts_setup_accum_log_file(osm_subn_t *p_subn, void *p_val) +{ + p_subn->p_osm->log.accum_log_file = *((boolean_t *) p_val); +} + +static void opts_setup_log_max_size(osm_subn_t *p_subn, void *p_val) +{ + uint32_t log_max_size = *((uint32_t *) p_val); + + p_subn->p_osm->log.max_size = log_max_size << 20; /* convert from MB to Bytes */ +} + +static void opts_setup_sminfo_polling_timeout(osm_subn_t *p_subn, void *p_val) +{ + osm_sm_t *p_sm; + uint32_t sminfo_polling_timeout = *((uint32_t *) p_val); + + p_sm = &p_subn->p_osm->sm; + cl_timer_stop(&p_sm->polling_timer); + cl_timer_start(&p_sm->polling_timer, sminfo_polling_timeout); +} + +static void opts_setup_routing_engine(osm_subn_t *p_subn, void *p_val) +{ + char *routing_engine_names = (char *) p_val; + + destroy_routing_engines(p_subn->p_osm); + setup_routing_engines(p_subn->p_osm, routing_engine_names); +} + +static void opts_setup_sm_priority(osm_subn_t *p_subn, void *p_val) +{ + osm_sm_t *p_sm; + uint8_t sm_priority = *((uint8_t *) p_val); + + p_sm = &p_subn->p_osm->sm; + osm_set_sm_priority(p_sm, sm_priority); +} + /********************************************************************** **********************************************************************/ void osm_subn_construct(IN osm_subn_t * const p_subn) @@ -470,137 +652,167 @@ static void log_config_value(char *name, const char *fmt, ...) } static void -opts_unpack_net64(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint64_t * p_val) +opts_parse_net64(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN setup_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint64_t val = strtoull(p_val_str, NULL, 0); - if (cl_hton64(val) != *p_val) { - log_config_value(p_key, "0x%016" PRIx64, val); - *p_val = cl_ntoh64(val); - } + uint64_t *p_val = (uint64_t *) p_v; + uint64_t val = strtoull(p_val_str, NULL, 0); + + if (cl_hton64(val) != *p_val) { + log_config_value(p_key, "0x%016" PRIx64, val); + if (pfn) + pfn(p_subn, &val); + *p_val = cl_ntoh64(val); } } /********************************************************************** **********************************************************************/ static void -opts_unpack_uint32(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint32_t * p_val) +opts_parse_uint32(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN setup_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint32_t val = strtoul(p_val_str, NULL, 0); - if (val != *p_val) { - log_config_value(p_key, "%u", val); - *p_val = val; - } + uint32_t *p_val = (uint32_t *) p_v; + uint32_t val = strtoul(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_int32(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN int32_t * p_val) +opts_parse_int32(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN setup_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - int32_t val = strtol(p_val_str, NULL, 0); - if (val != *p_val) { - log_config_value(p_key, "%d", val); - *p_val = val; - } + int32_t *p_val = (int32_t *) p_v; + int32_t val = strtol(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%d", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_uint16(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint16_t * p_val) +opts_parse_uint16(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN setup_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); - if (val != *p_val) { - log_config_value(p_key, "%u", val); - *p_val = val; - } + uint16_t *p_val = (uint16_t *) p_v; + uint16_t val = (uint16_t) strtoul(p_val_str, NULL, 0); + + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_net16(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint16_t * p_val) +opts_parse_net16(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN setup_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint32_t val; - val = strtoul(p_val_str, NULL, 0); - CL_ASSERT(val < 0x10000); - if (cl_hton32(val) != *p_val) { - log_config_value(p_key, "0x%04x", val); - *p_val = cl_hton16((uint16_t) val); - } + uint16_t *p_val = (uint16_t *) p_v; + uint32_t val = strtoul(p_val_str, NULL, 0); + + CL_ASSERT(val < 0x10000); + if (cl_hton32(val) != *p_val) { + log_config_value(p_key, "0x%04x", val); + if (pfn) + pfn(p_subn, &val); + *p_val = cl_hton16((uint16_t) val); } } /********************************************************************** **********************************************************************/ static void -opts_unpack_uint8(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN uint8_t * p_val) +opts_parse_uint8(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN setup_fn_t pfn) { - if (!strcmp(p_req_key, p_key)) { - uint32_t val; - val = strtoul(p_val_str, NULL, 0); - CL_ASSERT(val < 0x100); - if (val != *p_val) { - log_config_value(p_key, "%u", val); - *p_val = (uint8_t) val; - } + uint8_t *p_val = (uint8_t *) p_v; + uint32_t val = strtoul(p_val_str, NULL, 0); + + CL_ASSERT(val < 0x100); + if (val != *p_val) { + log_config_value(p_key, "%u", val); + if (pfn) + pfn(p_subn, &val); + *p_val = (uint8_t) val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_boolean(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN boolean_t * p_val) +opts_parse_boolean(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN setup_fn_t pfn) { - if (!strcmp(p_req_key, p_key) && p_val_str) { - boolean_t val; - if (strcmp("TRUE", p_val_str)) - val = FALSE; - else - val = TRUE; - - if (val != *p_val) { - log_config_value(p_key, "%s", p_val_str); - *p_val = val; - } + boolean_t *p_val = (boolean_t *) p_v; + boolean_t val; + + if (!p_val_str) + return; + + if (strcmp("TRUE", p_val_str)) + val = FALSE; + else + val = TRUE; + + if (val != *p_val) { + log_config_value(p_key, "%s", p_val_str); + if (pfn) + pfn(p_subn, &val); + *p_val = val; } } /********************************************************************** **********************************************************************/ static void -opts_unpack_charp(IN char *p_req_key, - IN char *p_key, IN char *p_val_str, IN char **p_val) +opts_parse_charp(IN osm_subn_t *p_subn, + IN char *p_key, IN char *p_val_str, + IN void *p_v, IN setup_fn_t pfn) { - if (!strcmp(p_req_key, p_key) && p_val_str) { - const char *current_str = *p_val ? *p_val : null_str ; - if (strcmp(p_val_str, current_str)) { - log_config_value(p_key, "%s", p_val_str); - /* special case the "(null)" string */ - if (strcmp(null_str, p_val_str) == 0) { - *p_val = NULL; - } else { - /* - Ignore the possible memory leak here; - the pointer may be to a static default. - */ - *p_val = strdup(p_val_str); - } + char **p_val = (char **) p_v; + const char *current_str = *p_val ? *p_val : null_str ; + + if (!p_val_str) + return; + + if (strcmp(p_val_str, current_str)) { + log_config_value(p_key, "%s", p_val_str); + /* special case the "(null)" string */ + if (strcmp(null_str, p_val_str) == 0) { + if (pfn) + pfn(p_subn, NULL); + *p_val = NULL; + } else { + if (pfn) + pfn(p_subn, p_val_str); + /* + Ignore the possible memory leak here; + the pointer may be to a static default. + */ + *p_val = strdup(p_val_str); } } } @@ -631,25 +843,6 @@ static char *clean_val(char *val) /********************************************************************** **********************************************************************/ -static void -subn_parse_qos_options(IN const char *prefix, - IN char *p_key, - IN char *p_val_str, IN osm_qos_options_t * opt) -{ - char name[256]; - - snprintf(name, sizeof(name), "%s_max_vls", prefix); - opts_unpack_uint32(name, p_key, p_val_str, &opt->max_vls); - snprintf(name, sizeof(name), "%s_high_limit", prefix); - opts_unpack_int32(name, p_key, p_val_str, &opt->high_limit); - snprintf(name, sizeof(name), "%s_vlarb_high", prefix); - opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_high); - snprintf(name, sizeof(name), "%s_vlarb_low", prefix); - opts_unpack_charp(name, p_key, p_val_str, &opt->vlarb_low); - snprintf(name, sizeof(name), "%s_sl2vl", prefix); - opts_unpack_charp(name, p_key, p_val_str, &opt->sl2vl); -} - static int subn_dump_qos_options(FILE * file, const char *set_name, @@ -1000,6 +1193,8 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) char line[1024]; FILE *opts_file; char *p_key, *p_val; + const opt_rec_t *r; + char *p_field; opts_file = fopen(file_name, "r"); if (!opts_file) { @@ -1023,231 +1218,14 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) p_val = clean_val(p_val); - opts_unpack_net64("guid", p_key, p_val, &p_opts->guid); - - opts_unpack_net64("m_key", p_key, p_val, &p_opts->m_key); - - opts_unpack_net64("sm_key", p_key, p_val, &p_opts->sm_key); - - opts_unpack_net64("sa_key", p_key, p_val, &p_opts->sa_key); - - opts_unpack_net64("subnet_prefix", - p_key, p_val, &p_opts->subnet_prefix); - - opts_unpack_net16("m_key_lease_period", - p_key, p_val, &p_opts->m_key_lease_period); - - opts_unpack_uint32("sweep_interval", - p_key, p_val, &p_opts->sweep_interval); - - opts_unpack_uint32("max_wire_smps", - p_key, p_val, &p_opts->max_wire_smps); - - opts_unpack_charp("console", p_key, p_val, &p_opts->console); - - opts_unpack_uint16("console_port", - p_key, p_val, &p_opts->console_port); - - opts_unpack_uint32("transaction_timeout", - p_key, p_val, &p_opts->transaction_timeout); - - opts_unpack_uint32("max_msg_fifo_timeout", - p_key, p_val, &p_opts->max_msg_fifo_timeout); - - opts_unpack_uint8("sm_priority", - p_key, p_val, &p_opts->sm_priority); - - opts_unpack_uint8("lmc", p_key, p_val, &p_opts->lmc); - - opts_unpack_boolean("lmc_esp0", - p_key, p_val, &p_opts->lmc_esp0); - - opts_unpack_uint8("max_op_vls", - p_key, p_val, &p_opts->max_op_vls); - - opts_unpack_uint8("force_link_speed", - p_key, p_val, &p_opts->force_link_speed); - - opts_unpack_boolean("reassign_lids", - p_key, p_val, &p_opts->reassign_lids); - - opts_unpack_boolean("ignore_other_sm", - p_key, p_val, &p_opts->ignore_other_sm); - - opts_unpack_boolean("single_thread", - p_key, p_val, &p_opts->single_thread); - - opts_unpack_boolean("disable_multicast", - p_key, p_val, &p_opts->disable_multicast); - - opts_unpack_boolean("force_log_flush", - p_key, p_val, &p_opts->force_log_flush); - - opts_unpack_uint8("subnet_timeout", - p_key, p_val, &p_opts->subnet_timeout); - - opts_unpack_uint8("packet_life_time", - p_key, p_val, &p_opts->packet_life_time); - - opts_unpack_uint8("vl_stall_count", - p_key, p_val, &p_opts->vl_stall_count); - - opts_unpack_uint8("leaf_vl_stall_count", - p_key, p_val, &p_opts->leaf_vl_stall_count); - - opts_unpack_uint8("head_of_queue_lifetime", p_key, p_val, - &p_opts->head_of_queue_lifetime); - - opts_unpack_uint8("leaf_head_of_queue_lifetime", p_key, p_val, - &p_opts->leaf_head_of_queue_lifetime); - - opts_unpack_uint8("local_phy_errors_threshold", p_key, p_val, - &p_opts->local_phy_errors_threshold); - - opts_unpack_uint8("overrun_errors_threshold", p_key, p_val, - &p_opts->overrun_errors_threshold); - - opts_unpack_uint32("sminfo_polling_timeout", p_key, p_val, - &p_opts->sminfo_polling_timeout); - - opts_unpack_uint32("polling_retry_number", - p_key, p_val, &p_opts->polling_retry_number); - - opts_unpack_boolean("force_heavy_sweep", - p_key, p_val, &p_opts->force_heavy_sweep); - - opts_unpack_uint8("log_flags", - p_key, p_val, &p_opts->log_flags); - - opts_unpack_charp("port_prof_ignore_file", p_key, p_val, - &p_opts->port_prof_ignore_file); - - opts_unpack_boolean("port_profile_switch_nodes", p_key, p_val, - &p_opts->port_profile_switch_nodes); - - opts_unpack_boolean("sweep_on_trap", - p_key, p_val, &p_opts->sweep_on_trap); - - opts_unpack_charp("routing_engine", - p_key, p_val, &p_opts->routing_engine_names); - - opts_unpack_boolean("connect_roots", - p_key, p_val, &p_opts->connect_roots); - - opts_unpack_boolean("use_ucast_cache", - p_key, p_val, &p_opts->use_ucast_cache); - - opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file); - - opts_unpack_uint32("log_max_size", p_key, p_val, - (void *) & p_opts->log_max_size); - p_opts->log_max_size *= 1024 * 1024; /* convert to MB */ - - opts_unpack_charp("partition_config_file", - p_key, p_val, &p_opts->partition_config_file); - - opts_unpack_boolean("no_partition_enforcement", p_key, p_val, - &p_opts->no_partition_enforcement); - - opts_unpack_boolean("qos", p_key, p_val, &p_opts->qos); - - opts_unpack_charp("qos_policy_file", - p_key, p_val, &p_opts->qos_policy_file); - - opts_unpack_boolean("accum_log_file", - p_key, p_val, &p_opts->accum_log_file); - - opts_unpack_charp("dump_files_dir", - p_key, p_val, &p_opts->dump_files_dir); - - opts_unpack_charp("lid_matrix_dump_file", - p_key, p_val, &p_opts->lid_matrix_dump_file); - - opts_unpack_charp("lfts_file", - p_key, p_val, &p_opts->lfts_file); - - opts_unpack_charp("root_guid_file", - p_key, p_val, &p_opts->root_guid_file); - - opts_unpack_charp("cn_guid_file", - p_key, p_val, &p_opts->cn_guid_file); - - opts_unpack_charp("ids_guid_file", - p_key, p_val, &p_opts->ids_guid_file); - - opts_unpack_charp("guid_routing_order_file", p_key, p_val, - &p_opts->guid_routing_order_file); - - opts_unpack_charp("sa_db_file", - p_key, p_val, &p_opts->sa_db_file); - - opts_unpack_boolean("do_mesh_analysis", - p_key, p_val, &p_opts->do_mesh_analysis); - - opts_unpack_boolean("exit_on_fatal", - p_key, p_val, &p_opts->exit_on_fatal); - - opts_unpack_boolean("honor_guid2lid_file", - p_key, p_val, &p_opts->honor_guid2lid_file); - - opts_unpack_boolean("daemon", p_key, p_val, &p_opts->daemon); - - opts_unpack_boolean("sm_inactive", - p_key, p_val, &p_opts->sm_inactive); - - opts_unpack_boolean("babbling_port_policy", - p_key, p_val, - &p_opts->babbling_port_policy); - -#ifdef ENABLE_OSM_PERF_MGR - opts_unpack_boolean("perfmgr", p_key, p_val, &p_opts->perfmgr); - - opts_unpack_boolean("perfmgr_redir", - p_key, p_val, &p_opts->perfmgr_redir); - - opts_unpack_uint16("perfmgr_sweep_time_s", - p_key, p_val, &p_opts->perfmgr_sweep_time_s); - - opts_unpack_uint32("perfmgr_max_outstanding_queries", - p_key, p_val, - &p_opts->perfmgr_max_outstanding_queries); - - opts_unpack_charp("event_db_dump_file", - p_key, p_val, &p_opts->event_db_dump_file); -#endif /* ENABLE_OSM_PERF_MGR */ - - opts_unpack_charp("event_plugin_name", - p_key, p_val, &p_opts->event_plugin_name); - - opts_unpack_charp("node_name_map_name", - p_key, p_val, &p_opts->node_name_map_name); - - subn_parse_qos_options("qos", - p_key, p_val, &p_opts->qos_options); - - subn_parse_qos_options("qos_ca", - p_key, p_val, &p_opts->qos_ca_options); - - subn_parse_qos_options("qos_sw0", - p_key, p_val, &p_opts->qos_sw0_options); - - subn_parse_qos_options("qos_swe", - p_key, p_val, &p_opts->qos_swe_options); - - subn_parse_qos_options("qos_rtr", - p_key, p_val, &p_opts->qos_rtr_options); - - opts_unpack_boolean("enable_quirks", - p_key, p_val, &p_opts->enable_quirks); - - opts_unpack_boolean("no_clients_rereg", - p_key, p_val, &p_opts->no_clients_rereg); - - opts_unpack_charp("prefix_routes_file", - p_key, p_val, &p_opts->prefix_routes_file); + for (r = opt_tbl; r->name; r++) { + if (strcmp(r->name, p_key)) + continue; - opts_unpack_boolean("consolidate_ipv6_snm_req", p_key, p_val, - &p_opts->consolidate_ipv6_snm_req); + p_field = (char *)p_opts + r->opt_offset; + /* don't call setup function first time */ + r->parse_fn(NULL, p_key, p_val, p_field, NULL); + } } fclose(opts_file); @@ -1258,61 +1236,58 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) int osm_subn_rescan_conf_files(IN osm_subn_t * const p_subn) { + osm_subn_opt_t *p_opts = &p_subn->opt; + const opt_rec_t *r; FILE *opts_file; char line[1024]; - char *p_key, *p_val, *p_last; + char *p_key, *p_val; + char *p_field; - if (!p_subn->opt.config_file) + if (!p_opts->config_file) return 0; - opts_file = fopen(p_subn->opt.config_file, "r"); + opts_file = fopen(p_opts->config_file, "r"); if (!opts_file) { if (errno == ENOENT) return 1; OSM_LOG(&p_subn->p_osm->log, OSM_LOG_ERROR, "cannot open file \'%s\': %s\n", - p_subn->opt.config_file, strerror(errno)); + p_opts->config_file, strerror(errno)); return -1; } - subn_free_qos_options(&p_subn->opt.qos_options); - subn_free_qos_options(&p_subn->opt.qos_ca_options); - subn_free_qos_options(&p_subn->opt.qos_sw0_options); - subn_free_qos_options(&p_subn->opt.qos_swe_options); - subn_free_qos_options(&p_subn->opt.qos_rtr_options); + subn_free_qos_options(&p_opts->qos_options); + subn_free_qos_options(&p_opts->qos_ca_options); + subn_free_qos_options(&p_opts->qos_sw0_options); + subn_free_qos_options(&p_opts->qos_swe_options); + subn_free_qos_options(&p_opts->qos_rtr_options); - subn_init_qos_options(&p_subn->opt.qos_options); - subn_init_qos_options(&p_subn->opt.qos_ca_options); - subn_init_qos_options(&p_subn->opt.qos_sw0_options); - subn_init_qos_options(&p_subn->opt.qos_swe_options); - subn_init_qos_options(&p_subn->opt.qos_rtr_options); + subn_init_qos_options(&p_opts->qos_options); + subn_init_qos_options(&p_opts->qos_ca_options); + subn_init_qos_options(&p_opts->qos_sw0_options); + subn_init_qos_options(&p_opts->qos_swe_options); + subn_init_qos_options(&p_opts->qos_rtr_options); while (fgets(line, 1023, opts_file) != NULL) { /* get the first token */ - p_key = strtok_r(line, " \t\n", &p_last); - if (p_key) { - p_val = strtok_r(NULL, " \t\n", &p_last); - - subn_parse_qos_options("qos", p_key, p_val, - &p_subn->opt.qos_options); - - subn_parse_qos_options("qos_ca", p_key, p_val, - &p_subn->opt.qos_ca_options); - - subn_parse_qos_options("qos_sw0", p_key, p_val, - &p_subn->opt.qos_sw0_options); + p_key = strtok_r(line, " \t\n", &p_val); + if (!p_key) + continue; - subn_parse_qos_options("qos_swe", p_key, p_val, - &p_subn->opt.qos_swe_options); + p_val = clean_val(p_val); - subn_parse_qos_options("qos_rtr", p_key, p_val, - &p_subn->opt.qos_rtr_options); + for (r = opt_tbl; r->name; r++) { + if (!r->can_update || strcmp(r->name, p_key)) + continue; + p_field = (char *)p_opts + r->opt_offset; + r->parse_fn(p_subn, p_key, + p_val, p_field, r->setup_fn); } } fclose(opts_file); - osm_subn_verify_config(&p_subn->opt); + osm_subn_verify_config(p_opts); osm_parse_prefix_routes_file(p_subn); -- 1.5.5 From sashak at voltaire.com Mon Jan 26 06:38:29 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 26 Jan 2009 16:38:29 +0200 Subject: [ofa-general] ***SPAM*** Unable to get IPoIB working In-Reply-To: References: Message-ID: <20090126143822.GG5814@sashak.voltaire.com> Hi Chuck, On 09:24 Mon 26 Jan , Chuck Hartley wrote: > > I just brought up two new machines with ConnectX adapters and have > been unable to get them to work using IPoIB. They are the only two > hosts on the network and are connected via a Mellanox MTS3600 switch. > All are running OFED 1.4 on Fedora 9, and the latest firmware (2.6.0) > on the HCAs. I am unable to ping from one adapter to the other: > These are our first ConnectX HCAs. We have a number of hosts using > InfiniHost HCAs, and IPoIB "just worked" without further configuration > after OFED installation. > > # ping 172.16.0.70 > PING 172.16.0.70 (172.16.0.70) 56(84) bytes of data. > >From 172.16.0.71 icmp_seq=2 Destination Host Unreachable > > # arp -n 172.16.0.70 > Address HWtype HWaddress Flags Mask Iface > 172.16.0.70 (incomplete) ib0 > > # ifconfig ib0 > ib0 Link encap:InfiniBand HWaddr > 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:172.16.0.71 Bcast:172.16.255.255 Mask:255.255.0.0 > UP BROADCAST MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > The ifconfig output looks ok, but this output from ip seems to contradict it(?): > > # ip addr show ib0 > 4: ib0: mtu 65520 qdisc pfifo_fast > state DOWN qlen 256 > link/infiniband > 80:00:04:04:fe:80:00:00:00:00:00:00:00:30:48:c6:4c:18:00:01 brd > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > inet 172.16.0.71/16 brd 172.16.255.255 scope global ib0 > > # saquery > NodeRecord dump: > lid.....................0x5 > reserved................0x0 > base_version............0x1 > class_version...........0x1 > node_type...............Channel Adapter > num_ports...............0x2 > sys_guid................0x0002c9030003360f > node_guid...............0x0002c9030003360c > port_guid...............0x0002c9030003360d > partition_cap...........0x80 > device_id...............0x673C > revision................0xA0 > port_num................0x1 > vendor_id...............0x2C9 > NodeDescription.........linux71 HCA-1 > NodeRecord dump: > lid.....................0x2 > reserved................0x0 > base_version............0x1 > class_version...........0x1 > node_type...............Switch > num_ports...............0x24 > sys_guid................0x0002c9020040536b > node_guid...............0x0002c90200405368 > port_guid...............0x0002c90200405368 > partition_cap...........0x8 > device_id...............0xBD36 > revision................0xA0 > port_num................0x1 > vendor_id...............0x2C9 > NodeDescription.........Infiniscale-IV Mellanox Technologies > NodeRecord dump: > lid.....................0x4 > reserved................0x0 > base_version............0x1 > class_version...........0x1 > node_type...............Channel Adapter > num_ports...............0x2 > sys_guid................0x0002c90300032de3 > node_guid...............0x0002c90300032de0 > port_guid...............0x0002c90300032de1 > partition_cap...........0x80 > device_id...............0x673C > revision................0xA0 > port_num................0x1 > vendor_id...............0x2C9 > NodeDescription.........linux70 HCA-1 > > > Any ideas on what is going on here? Thanks for an help you can provide. Which SM are you using? Any errors in dmesg? Look at ibnetdiscover output that all links are in good (speed/width) state. Sasha From Robert at saq.co.uk Mon Jan 26 06:37:22 2009 From: Robert at saq.co.uk (Robert Dunkley) Date: Mon, 26 Jan 2009 14:37:22 -0000 Subject: [ofa-general] ***SPAM*** Unable to get IPoIB working References: Message-ID: Looks like the Infiniband side of things is fine (Although there are many more knowledgeable people than me on this list). It might be the MTU value that is causing the issue, I know at least in the case of Centos that it does not like values that near to theoretical IP max so try setting it to 1500 just for testing. Also, since you have the such a high MTU value I assume you are running IPOIB in "Connected" mode so maybe you should disable multicast for the adaptor (Connected mode does not support multicast AFAIK). The last possibility I can think of is some sort of default first ID being assigned to the older Infiniband card in the system or IPOIB possibly trying to use the other card. If its not needed try removing or disabling the older card. Otherwise, maybe someone else here can help you. Rob -----Original Message----- From: Chuck Hartley [mailto:hartlch14 at gmail.com] Sent: 26 January 2009 14:30 To: Robert Dunkley Subject: Re: [ofa-general] ***SPAM*** Unable to get IPoIB working On Mon, Jan 26, 2009 at 9:24 AM, Robert Dunkley wrote: > Please post the results of ibstat. > > Rob ibstat output: # ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0002c9030003360c System image GUID: 0x0002c9030003360f Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 5 LMC: 0 SM lid: 4 Capability mask: 0x02510868 Port GUID: 0x0002c9030003360d Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c9030003360e CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.2.0 Hardware version: a0 Node GUID: 0x003048c64c180000 System image GUID: 0x003048c64c180003 Port 1: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510a68 Port GUID: 0x003048c64c180001 The SAQ Group Registered Office: 18 Chapel Street, Petersfield, Hampshire GU32 3DZ SAQ is the trading name of SEMTEC Limited. Registered in England & Wales Company Number: 06481952 http://www.saqnet.co.uk AS29219 SAQ Group Delivers high quality, honestly priced communication and I.T. services to UK Business. Broadband : Domains : Email : Hosting : CoLo : Servers : Racks : Transit : Backups : Managed Networks : Remote Support. ISPA Member Find us in http://www.thebestof.co.uk/petersfield From sashak at voltaire.com Mon Jan 26 06:54:57 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 26 Jan 2009 16:54:57 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: add lid parameter to NodeRecord query Message-ID: <20090126145457.GH5814@sashak.voltaire.com> Add optional LID parameter to NodeRecord SA Query. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/man/saquery.8 | 2 +- infiniband-diags/src/saquery.c | 97 +++++++++++++++++++++++++-------------- 2 files changed, 63 insertions(+), 36 deletions(-) diff --git a/infiniband-diags/man/saquery.8 b/infiniband-diags/man/saquery.8 index 82a5fed..ccca4da 100644 --- a/infiniband-diags/man/saquery.8 +++ b/infiniband-diags/man/saquery.8 @@ -104,7 +104,7 @@ for node name map file format. Only used with the \fB\-O\fR and \fB\-U\fR optio .TP Supported query names (and aliases): ClassPortInfo (CPI) - NodeRecord (NR) + NodeRecord (NR) [lid] PortInfoRecord (PIR) [[lid]/[port]] SL2VLTableRecord (SL2VL) [[lid]/[in_port]/[out_port]] PKeyTableRecord (PKTR) [[lid]/[port]/[block]] diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index cb7a731..95cb8d3 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -129,15 +129,44 @@ static void print_node_desc(ib_node_record_t * node_record) } } +static void dump_node_record(void *data) +{ + ib_node_record_t *nr = data; + ib_node_info_t *ni = &nr->node_info; + + printf("NodeRecord dump:\n" + "\t\tlid.....................0x%X\n" + "\t\treserved................0x%X\n" + "\t\tbase_version............0x%X\n" + "\t\tclass_version...........0x%X\n" + "\t\tnode_type...............%s\n" + "\t\tnum_ports...............0x%X\n" + "\t\tsys_guid................0x%016" PRIx64 "\n" + "\t\tnode_guid...............0x%016" PRIx64 "\n" + "\t\tport_guid...............0x%016" PRIx64 "\n" + "\t\tpartition_cap...........0x%X\n" + "\t\tdevice_id...............0x%X\n" + "\t\trevision................0x%X\n" + "\t\tport_num................0x%X\n" + "\t\tvendor_id...............0x%X\n" + "\t\tNodeDescription.........%s\n", + cl_ntoh16(nr->lid), cl_ntoh16(nr->resv), + ni->base_version, ni->class_version, + ib_get_node_type_str(ni->node_type), ni->num_ports, + cl_ntoh64(ni->sys_guid), cl_ntoh64(ni->node_guid), + cl_ntoh64(ni->port_guid), cl_ntoh16(ni->partition_cap), + cl_ntoh16(ni->device_id), cl_ntoh32(ni->revision), + ib_node_info_get_local_port_num(ni), + cl_ntoh32(ib_node_info_get_vendor_id(ni)), + clean_nodedesc((char *)nr->node_desc.description)); +} + static void print_node_record(ib_node_record_t * node_record) { - ib_node_info_t *p_ni = NULL; - ib_node_desc_t *p_nd = NULL; + ib_node_info_t *p_ni = &node_record->node_info; + ib_node_desc_t *p_nd = &node_record->node_desc; char *name; - p_ni = &(node_record->node_info); - p_nd = &(node_record->node_desc); - switch (node_print_desc) { case LID_ONLY: case UNIQUE_LID_ONLY: @@ -159,31 +188,7 @@ static void print_node_record(ib_node_record_t * node_record) break; } - printf("NodeRecord dump:\n" - "\t\tlid.....................0x%X\n" - "\t\treserved................0x%X\n" - "\t\tbase_version............0x%X\n" - "\t\tclass_version...........0x%X\n" - "\t\tnode_type...............%s\n" - "\t\tnum_ports...............0x%X\n" - "\t\tsys_guid................0x%016" PRIx64 "\n" - "\t\tnode_guid...............0x%016" PRIx64 "\n" - "\t\tport_guid...............0x%016" PRIx64 "\n" - "\t\tpartition_cap...........0x%X\n" - "\t\tdevice_id...............0x%X\n" - "\t\trevision................0x%X\n" - "\t\tport_num................0x%X\n" - "\t\tvendor_id...............0x%X\n" - "\t\tNodeDescription.........%s\n", - cl_ntoh16(node_record->lid), cl_ntoh16(node_record->resv), - p_ni->base_version, p_ni->class_version, - ib_get_node_type_str(p_ni->node_type), p_ni->num_ports, - cl_ntoh64(p_ni->sys_guid), cl_ntoh64(p_ni->node_guid), - cl_ntoh64(p_ni->port_guid), cl_ntoh16(p_ni->partition_cap), - cl_ntoh16(p_ni->device_id), cl_ntoh32(p_ni->revision), - ib_node_info_get_local_port_num(p_ni), - cl_ntoh32(ib_node_info_get_vendor_id(p_ni)), - clean_nodedesc((char *)node_record->node_desc.description)); + dump_node_record(node_record); } static void dump_path_record(void *data) @@ -1071,7 +1076,31 @@ static int query_class_port_info(const struct query_cmd *q, static int query_node_records(const struct query_cmd *q, osm_bind_handle_t h, int argc, char *argv[]) { - return print_node_records(h); + ib_node_record_t nr; + ib_net64_t comp_mask = 0; + int lid; + ib_api_status_t status; + + if (argc > 0) + parse_lid_and_ports(h, argv[0], &lid, NULL, NULL); + + memset(&nr, 0, sizeof(nr)); + + if (lid > 0) { + nr.lid = cl_hton16(lid); + comp_mask |= IB_NR_COMPMASK_LID; + } + + status = get_any_records(h, IB_MAD_ATTR_NODE_RECORD, 0, + comp_mask, &nr, + ib_get_attr_offset(sizeof(nr)), 0); + if (status != IB_SUCCESS) + return status; + + dump_results(&result, dump_node_record); + return_mad(); + + return 0; } static int query_portinfo_records(const struct query_cmd *q, @@ -1099,7 +1128,6 @@ static int query_portinfo_records(const struct query_cmd *q, status = get_any_records(h, IB_MAD_ATTR_PORTINFO_RECORD, 0, comp_mask, &pir, ib_get_attr_offset(sizeof(pir)), 0); - if (status != IB_SUCCESS) return status; @@ -1454,7 +1482,7 @@ static const struct query_cmd query_cmds[] = { {"ClassPortInfo", "CPI", IB_MAD_ATTR_CLASS_PORT_INFO, NULL, query_class_port_info}, {"NodeRecord", "NR", IB_MAD_ATTR_NODE_RECORD, - NULL, query_node_records}, + "[lid]", query_node_records}, {"PortInfoRecord", "PIR", IB_MAD_ATTR_PORTINFO_RECORD, "[[lid]/[port]]", query_portinfo_records}, {"SL2VLTableRecord", "SL2VL", IB_MAD_ATTR_SLVL_RECORD, @@ -1757,9 +1785,8 @@ int main(int argc, char **argv) status = get_print_path_rec_gid(h, (ib_gid_t *) & src_addr.s6_addr, (ib_gid_t *) & dst_addr.s6_addr); - } else { + } else status = query_path_records(q, h, 0, NULL); - } break; case SAQUERY_CMD_CLASS_PORT_INFO: status = get_print_class_port_info(h); -- 1.6.0.4.766.g6fc4a From jackm at dev.mellanox.co.il Mon Jan 26 07:41:08 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 26 Jan 2009 17:41:08 +0200 Subject: [ofa-general] IPoIB kernel Oops -- possible race condition identified. Message-ID: <200901261741.08824.jackm@dev.mellanox.co.il> The following Oops occurred several times on an X86 host when unloading the driver: (console command sequence: /etc/init.d/openibd start opensm & pkill -2 opensm /etc/init.d/openibd stop ) ******************************************************************** IP: [] :ib_ipoib:ipoib_mcast_join_task+0x193/0x217 *pde = 00000000 Oops: 0000 [#1] SMP ... Pid: 22483, comm: ipoib Not tainted (2.6.27.5 #1) EIP: 0060:[] EFLAGS: 00010286 CPU: 1 EIP is at ipoib_mcast_join_task+0x193/0x217 [ib_ipoib] EAX: 00000000 EBX: c2060480 ECX: 0005c700 EDX: ffffffff ESI: c20605dc EDI: c2060154 EBP: c2060480 ESP: f72aff64 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process ipoib (pid: 22483, ti=f72af000 task=f59fcdc0 task.ti=f72af000) Stack: c2060000 00000004 00000005 00000005 00000001 02500848 00001000 00000000 00000000 00010008 03000001 02001200 00000504 f509bbc0 c2060508 f8e678b6 00000000 c04307a8 f509bbc0 c0430e7c f509bbcc c0430f2f 00000000 f59fcdc0 Call Trace: [] ipoib_mcast_join_task+0x0/0x217 [ib_ipoib] [] run_workqueue+0x6a/0xdf [] worker_thread+0x0/0xbd [] worker_thread+0xb3/0xbd [] autoremove_wake_function+0x0/0x2d [] kthread+0x38/0x5d [] kthread+0x0/0x5d [] kernel_thread_helper+0x7/0x10 ======================= EIP: [] ipoib_mcast_join_task+0x193/0x217 [ib_ipoib] SS:ESP 0068:f72aff64 ********************************************************************** ipoib_mcast_join_task +0x193 is at (in file ipoib_multicast.c): priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); I think the problem is the following: priv->broadcast is NULLed out in procedure ipoib_mcast_dev_flush(), under the protection of a spinlock. However, in ipoib_mcast_join_task(), there is no spinlock protection in the access to priv->broadcast in the crash line given above. Note that there seems to be a race condition here. If the flush occurs after the following test at the start ipoib_mcast_join_task(): if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) return; then there is no protection at all later for priv->broadcast being NULLed elsewhere. - Jack From yosefe at Voltaire.COM Mon Jan 26 09:00:10 2009 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Mon, 26 Jan 2009 19:00:10 +0200 Subject: [ofa-general] Re: IPoIB kernel Oops -- possible race condition identified. In-Reply-To: <200901261741.08824.jackm@dev.mellanox.co.il> References: <200901261741.08824.jackm@dev.mellanox.co.il> Message-ID: <497DEC1A.2030104@Voltaire.COM> There's a patch of mine in OFED that's probably exposing a bug in ipoib. The bug is that priv->broadcast can be NULL-ified and join_task does not protect the check with the spinlock. The patch may expose the bug because it uses rtnl_lock(). However, in 2.6.28 kernel there's another version of this patch which does not take rtnl_lock, so the problem still exists but is probably much harder to reproduce. Please see https://kerneltrap.org/mailarchive/openfabrics-general/2009/1/13/4705114/thread What OFED version are you using? Jack Morgenstein wrote: > The following Oops occurred several times on an X86 host when unloading the driver: > (console command sequence: > /etc/init.d/openibd start > opensm & > pkill -2 opensm > /etc/init.d/openibd stop > ) > ******************************************************************** > IP: [] :ib_ipoib:ipoib_mcast_join_task+0x193/0x217 > *pde = 00000000 > Oops: 0000 [#1] SMP > ... > > Pid: 22483, comm: ipoib Not tainted (2.6.27.5 #1) > EIP: 0060:[] EFLAGS: 00010286 CPU: 1 > EIP is at ipoib_mcast_join_task+0x193/0x217 [ib_ipoib] > EAX: 00000000 EBX: c2060480 ECX: 0005c700 EDX: ffffffff > ESI: c20605dc EDI: c2060154 EBP: c2060480 ESP: f72aff64 > DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 > Process ipoib (pid: 22483, ti=f72af000 task=f59fcdc0 task.ti=f72af000) > Stack: c2060000 00000004 00000005 00000005 00000001 02500848 00001000 00000000 > 00000000 00010008 03000001 02001200 00000504 f509bbc0 c2060508 f8e678b6 > 00000000 c04307a8 f509bbc0 c0430e7c f509bbcc c0430f2f 00000000 f59fcdc0 > Call Trace: > [] ipoib_mcast_join_task+0x0/0x217 [ib_ipoib] > [] run_workqueue+0x6a/0xdf > [] worker_thread+0x0/0xbd > [] worker_thread+0xb3/0xbd > [] autoremove_wake_function+0x0/0x2d > [] kthread+0x38/0x5d > [] kthread+0x0/0x5d > [] kernel_thread_helper+0x7/0x10 > ======================= > EIP: [] ipoib_mcast_join_task+0x193/0x217 [ib_ipoib] SS:ESP 0068:f72aff64 > ********************************************************************** > ipoib_mcast_join_task +0x193 is at (in file ipoib_multicast.c): > priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > > I think the problem is the following: > priv->broadcast is NULLed out in procedure ipoib_mcast_dev_flush(), under the protection > of a spinlock. > > However, in ipoib_mcast_join_task(), there is no spinlock protection in the access to > priv->broadcast in the crash line given above. > > Note that there seems to be a race condition here. > If the flush occurs after the following test at the start ipoib_mcast_join_task(): > if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) > return; > then there is no protection at all later for priv->broadcast being NULLed elsewhere. > > - Jack -- --Yossi From chu11 at llnl.gov Mon Jan 26 09:22:09 2009 From: chu11 at llnl.gov (Al Chu) Date: Mon, 26 Jan 2009 09:22:09 -0800 Subject: [ofa-general] Re: [PATCH] infiniband-diags: command line option processing framework In-Reply-To: <20090125115136.GA20419@sashak.voltaire.com> References: <20090125115136.GA20419@sashak.voltaire.com> Message-ID: <1232990529.7773.47.camel@auk31.llnl.gov> Hey Sasha, Just wondering, what tools are you targeting for these unified command line options? Since (doing a quick glance) it will break some options for some tools. for ibstat /usr/sbin/ibstat -l # list all IB devices I don't think there will be a lot of issues. Any changes will just have to be announced/documented in the release notes so people can update scripts. Al On Sun, 2009-01-25 at 13:51 +0200, Sasha Khapyorsky wrote: > The main motivation of this is to unify infiniband-diags command line > options and tools usage. Also it simplifies programming and can remove > a lot of duplications. The usage message is also unified over all tools > and looks like: > > Usage: ibaddr [options] [] > > Options: > --gid_show, -g show gid address only > --lid_show, -l show lid range only > --Lid_show, -L show lid range (in decimal) only > --Ca, -C Ca name to use > --Port, -P Ca port number to use > --Direct, -D use Direct address argument > --Guid, -G use GUID address argument > --timeout, -t timeout in ms > --sm_port, -s SM port lid > --errors, -e show send and receive errors > --verbose, -v increase verbosity level > --debug, -d raise debug level > --usage, -u usage message > --help, -h help message > --version, -V show version > > Examples: > ibaddr # local port's address > ibaddr 32 # show lid range and gid of lid 32 > ibaddr -G 0x8f1040023 # same but using guid address > ibaddr -l 32 # show lid range only > ibaddr -L 32 # show decimal lid range only > ibaddr -g 32 # show gid address only > > Custom (per tool) option processing is also supported. > > Signed-off-by: Sasha Khapyorsky > --- > infiniband-diags/include/ibdiag_common.h | 28 +++- > infiniband-diags/src/ibdiag_common.c | 248 ++++++++++++++++++++++++++++++ > 2 files changed, 274 insertions(+), 2 deletions(-) > > diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h > index 0518579..4304826 100644 > --- a/infiniband-diags/include/ibdiag_common.h > +++ b/infiniband-diags/include/ibdiag_common.h > @@ -35,18 +35,42 @@ > #ifndef _IBDIAG_COMMON_H_ > #define _IBDIAG_COMMON_H_ > > +#include > + > extern int ibdebug; > +extern int ibverbose; > +extern char *ibd_ca; > +extern int ibd_ca_port; > +extern int ibd_dest_type; > +extern ib_portid_t *ibd_sm_id; > +extern int ibd_timeout; > > /*========================================================*/ > /* External interface */ > /*========================================================*/ > > #undef DEBUG > -#define DEBUG if (ibdebug || verbose) IBWARN > -#define VERBOSE if (ibdebug || verbose > 1) IBWARN > +#define DEBUG if (ibdebug || ibverbose) IBWARN > +#define VERBOSE if (ibdebug || ibverbose > 1) IBWARN > #define IBERROR(fmt, args...) iberror(__FUNCTION__, fmt, ## args) > > extern void iberror(const char *fn, char *msg, ...); > extern const char *get_build_version(void); > > +struct ibdiag_opt { > + const char *name; > + char letter; > + unsigned has_arg; > + const char *arg_tmpl; > + const char *description; > +}; > + > +extern int ibdiag_process_opts(int argc, char * const argv[], void *context, > + const char *exclude_common_str, > + const struct ibdiag_opt custom_opts[], > + int (*custom_handler)(void *cxt, int val, char *optarg), > + const char *usage_args, > + const char *usage_examples[]); > +extern void ibdiag_show_usage(); > + > #endif /* _IBDIAG_COMMON_H_ */ > diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c > index 3a9d5c2..8dcec5e 100644 > --- a/infiniband-diags/src/ibdiag_common.c > +++ b/infiniband-diags/src/ibdiag_common.c > @@ -46,11 +46,259 @@ > #include > #include > #include > +#include > > +#include > +#include > #include > #include > > int ibdebug; > +int ibverbose; > +char *ibd_ca; > +int ibd_ca_port; > +int ibd_dest_type = IB_DEST_LID; > +ib_portid_t *ibd_sm_id; > +int ibd_timeout; > + > +static ib_portid_t sm_portid = {0}; > + > +static const char *prog_name; > +static const char *prog_args; > +static const char **prog_examples; > +static struct option *long_opts; > +static const struct ibdiag_opt *opts_map[256]; > + > +static void pretty_print(int start, int width, const char *str) > +{ > + int len = width - start; > + const char *p, *e; > + > + while (1) { > + while(isspace(*str)) > + str++; > + p = str; > + do { > + e = p + 1; > + p = strchr(e, ' '); > + } while (p && p - str < len); > + if (!p) { > + fprintf(stderr, "%s", str); > + break; > + } > + if (e - str == 1) > + e = p; > + fprintf(stderr, "%.*s\n%*s", e - str, str, start, ""); > + str = e; > + } > +} > + > +void ibdiag_show_usage() > +{ > + struct option *o = long_opts; > + int n; > + > + fprintf(stderr, "\nUsage: %s [options] %s\n\n", prog_name, > + prog_args ? prog_args : ""); > + > + if (long_opts[0].name) > + fprintf(stderr, "Options:\n"); > + for (o = long_opts; o->name; o++) { > + const struct ibdiag_opt *io = opts_map[o->val]; > + n = fprintf(stderr, " --%s", io->name); > + if (isprint(io->letter)) > + n += fprintf(stderr, ", -%c", io->letter); > + if (io->has_arg) > + n += fprintf(stderr, " %s", > + io->arg_tmpl ? io->arg_tmpl : ""); > + if (io->description && *io->description) { > + n += fprintf(stderr, "%*s ", 24 - n > 0 ? 24 - n : 0, ""); > + pretty_print(n, 74, io->description); > + } > + fprintf(stderr, "\n"); > + } > + > + if (prog_examples) { > + const char **p; > + fprintf(stderr, "\nExamples:\n"); > + for (p = prog_examples; *p && **p; p++) > + fprintf(stderr, " %s %s\n", prog_name, *p); > + } > + > + fprintf(stderr, "\n"); > + > + exit(2); > +} > + > +static int process_opt(int ch, char *optarg) > +{ > + int val; > + > + switch (ch) { > + case 'h': > + case 'u': > + ibdiag_show_usage(); > + break; > + case 'V': > + fprintf(stderr, "%s %s\n", prog_name, get_build_version()); > + exit(2); > + case 'e': > + madrpc_show_errors(1); > + break; > + case 'v': > + ibverbose++; > + break; > + case 'd': > + ibdebug++; > + madrpc_show_errors(1); > + umad_debug(ibdebug - 1); > + break; > + case 'C': > + ibd_ca = optarg; > + break; > + case 'P': > + ibd_ca_port = strtoul(optarg, 0, 0); > + break; > + case 'D': > + ibd_dest_type = IB_DEST_DRPATH; > + break; > + case 'L': > + ibd_dest_type = IB_DEST_LID; > + break; > + case 'G': > + ibd_dest_type = IB_DEST_GUID; > + break; > + case 't': > + val = strtoul(optarg, 0, 0); > + madrpc_set_timeout(val); > + ibd_timeout = val; > + break; > + case 's': > + if (ib_resolve_portid_str(&sm_portid, optarg, IB_DEST_LID, 0) < 0) > + IBERROR("cannot resolve SM destination port %s", optarg); > + ibd_sm_id = &sm_portid; > + break; > + default: > + return -1; > + } > + > + return 0; > +} > + > +static const struct ibdiag_opt common_opts[] = { > + { "Ca", 'C', 1, "", "Ca name to use"}, > + { "Port", 'P', 1, "", "Ca port number to use"}, > + { "Direct", 'D', 0, NULL, "use Direct address argument"}, > + { "Lid", 'L', 0, NULL, "use LID address argument"}, > + { "Guid", 'G', 0, NULL, "use GUID address argument"}, > + { "timeout", 't', 1, "", "timeout in ms"}, > + { "sm_port", 's', 1, "", "SM port lid" }, > + { "errors", 'e', 0, NULL, "show send and receive errors" }, > + { "verbose", 'v', 0, NULL, "increase verbosity level" }, > + { "debug", 'd', 0, NULL, "raise debug level" }, > + { "usage", 'u', 0, NULL, "usage message" }, > + { "help", 'h', 0, NULL, "help message" }, > + { "version", 'V', 0, NULL, "show version" }, > + {} > +}; > + > +static void make_opt(struct option *l, const struct ibdiag_opt *o, > + const struct ibdiag_opt *map[]) > +{ > + l->name = o->name; > + l->has_arg = o->has_arg; > + l->flag = NULL; > + l->val = o->letter; > + if (!map[l->val]) > + map[l->val] = o; > +} > + > +static struct option *make_long_opts(const char *exclude_str, > + const struct ibdiag_opt *custom_opts, > + const struct ibdiag_opt *map[]) > +{ > + struct option *long_opts, *l; > + const struct ibdiag_opt *o; > + unsigned n = 0; > + > + if (custom_opts) > + for (o = custom_opts; o->name; o++) > + n++; > + > + long_opts = malloc((sizeof(common_opts)/sizeof(common_opts[0]) + n) * > + sizeof(*long_opts)); > + if (!long_opts) > + return NULL; > + > + l = long_opts; > + > + if (custom_opts) > + for (o = custom_opts; o->name; o++) > + make_opt(l++, o, map); > + > + for (o = common_opts; o->name; o++) { > + if (exclude_str && strchr(exclude_str, o->letter)) > + continue; > + make_opt(l++, o, map); > + } > + > + memset(l, 0, sizeof(*l)); > + > + return long_opts; > +} > + > +static void make_str_opts(const struct option *o, char *p, unsigned size) > +{ > + int i, n = 0; > + > + for (n = 0; o->name && n + 2 + o->has_arg < size; o++) { > + p[n++] = o->val; > + for (i = 0; i < o->has_arg; i++) > + p[n++] = ':'; > + } > + p[n] = '\0'; > +} > + > +int ibdiag_process_opts(int argc, char * const argv[], void *cxt, > + const char *exclude_common_str, > + const struct ibdiag_opt custom_opts[], > + int (*custom_handler)(void *cxt, int val, char *optarg), > + const char *usage_args, const char *usage_examples[]) > +{ > + char str_opts[1024]; > + const struct ibdiag_opt *o; > + > + memset(opts_map, 0, sizeof(opts_map)); > + > + prog_name = argv[0]; > + prog_args = usage_args; > + prog_examples = usage_examples; > + > + long_opts = make_long_opts(exclude_common_str, custom_opts, opts_map); > + if (!long_opts) > + return -1; > + > + make_str_opts(long_opts, str_opts, sizeof(str_opts)); > + > + while (1) { > + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); > + if ( ch == -1 ) > + break; > + o = opts_map[ch]; > + if (!o) > + ibdiag_show_usage(); > + if (custom_handler) { > + if (custom_handler(cxt, ch, optarg) && > + process_opt(ch, optarg)) > + ibdiag_show_usage(); > + } else if (process_opt(ch, optarg)) > + ibdiag_show_usage(); > + } > + > + free(long_opts); > + > + return 0; > +} > > extern char *argv0; > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From weiny2 at llnl.gov Mon Jan 26 09:34:49 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 26 Jan 2009 09:34:49 -0800 Subject: [ofa-general] Re: [PATCH] infiniband-diags/smpquery: usage improvement In-Reply-To: <20090126091419.GA5814@sashak.voltaire.com> References: <20090125115136.GA20419@sashak.voltaire.com> <20090125115249.GB20419@sashak.voltaire.com> <20090126091419.GA5814@sashak.voltaire.com> Message-ID: <20090126093449.3caaef53.weiny2@llnl.gov> This will break some of the internal scripts we have here which use "nodeinfo". I don't have an objection to all this option changing, but we need to make it clear that many options have changed throughout all the diags. Ira On Mon, 26 Jan 2009 11:14:19 +0200 Sasha Khapyorsky wrote: > > This makes usage of smpquery operations more user friendly - similar to > saquery each operation now has a shorter alias, string matching is case > insensitive and abbreviations are allowed for both operation name and > alias. And it is how this looks: > > Usage: smpquery [options] [op params] > > Supported operations (and aliases, case insensitive): > NodeInfo (NI) > NodeDesc (ND) > PortInfo (PI) [] > SwitchInfo (SI) > PKeyTable (PKeys) [] > SL2VLTable (SL2VL) [] > VLArbitration (VLArb) [] > GUIDInfo (GI) > > Signed-off-by: Sasha Khapyorsky > --- > infiniband-diags/src/smpquery.c | 32 +++++++++++++++++--------------- > 1 files changed, 17 insertions(+), 15 deletions(-) > > diff --git a/infiniband-diags/src/smpquery.c b/infiniband-diags/src/smpquery.c > index 7dcf888..44280e1 100644 > --- a/infiniband-diags/src/smpquery.c > +++ b/infiniband-diags/src/smpquery.c > @@ -54,7 +54,7 @@ > typedef char *(op_fn_t)(ib_portid_t *dest, char **argv, int argc); > > typedef struct match_rec { > - char *name; > + const char *name, *alias; > op_fn_t *fn; > unsigned opt_portnum; > } match_rec_t; > @@ -63,14 +63,14 @@ static op_fn_t node_desc, node_info, port_info, switch_info, pkey_table, > sl2vl_table, vlarb_table, guid_info; > > static const match_rec_t match_tbl[] = { > - { "nodeinfo", node_info }, > - { "nodedesc", node_desc }, > - { "portinfo", port_info, 1 }, > - { "switchinfo", switch_info }, > - { "pkeys", pkey_table, 1 }, > - { "sl2vl", sl2vl_table, 1 }, > - { "vlarb", vlarb_table, 1 }, > - { "guids", guid_info }, > + { "NodeInfo", "NI", node_info }, > + { "NodeDesc", "ND", node_desc }, > + { "PortInfo", "PI", port_info, 1 }, > + { "SwitchInfo", "SI", switch_info }, > + { "PKeyTable", "PKeys", pkey_table, 1 }, > + { "SL2VLTable", "SL2VL", sl2vl_table, 1 }, > + { "VLArbitration", "VLArb", vlarb_table, 1 }, > + { "GUIDInfo", "GI", guid_info }, > {0} > }; > > @@ -373,14 +373,15 @@ guid_info(ib_portid_t *dest, char **argv, int argc) > return 0; > } > > -static op_fn_t * > -match_op(char *name) > +static op_fn_t *match_op(char *name) > { > const match_rec_t *r; > + unsigned len = strlen(name); > for (r = match_tbl; r->name; r++) > - if (!strcmp(r->name, name)) > + if (!strncasecmp(r->name, name, len) || > + (r->alias && !strncasecmp(r->alias, name, len))) > return r->fn; > - return 0; > + return NULL; > } > > static int process_opt(void *context, int ch, char *optarg) > @@ -422,10 +423,11 @@ int main(int argc, char **argv) > }; > > n = sprintf(usage_args, " [op params]\n" > - "\nSupported ops:\n"); > + "\nSupported ops (and aliases, case insensitive):\n"); > for (r = match_tbl ; r->name ; r++) { > n += snprintf(usage_args + n, sizeof(usage_args) - n, > - " %s %s\n", r->name, > + " %s (%s) %s\n", r->name, > + r->alias ? r->alias : "", > r->opt_portnum ? " []" : ""); > if (n >= sizeof(usage_args)) > exit(-1); > -- > 1.6.0.4.766.g6fc4a > From swise at opengridcomputing.com Mon Jan 26 09:43:14 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 26 Jan 2009 11:43:14 -0600 Subject: [ofa-general] potential device removal deadlock Message-ID: <497DF632.7060609@opengridcomputing.com> Hey Roland/Sean, I'm looking at the rdma_[u]cm modules and how they generate DEVICE_REMOVAL events to user applications, and I see a potential deadlock. ib_unregister_device() calls the ib_client remove() functions in the reverse order from which the ib_clients were registered. And if you look at ib_uverbs_remove_one(), you'll see it will block until all references from user apps are released. So if ib_uverbs remove() gets called _before_ the rdma_cm remove() function, then the unregister process will deadlock since applications don't get notification of the device removal. Am I missing something, or is this a bug? I would think ib_uverbs should actually blow away the kernel parts of the user's handles allowing the device to be removed. Then the user app will discover things went south on the next down call into the uverbs code -or- by the DEVICE_REMOVAL rdma-cm event. I'm thinking about all this in the context of EEH handling for cxgb3. Thoughts? Thanks, Steve. From sashak at voltaire.com Mon Jan 26 09:51:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 26 Jan 2009 19:51:33 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/smpquery: usage improvement In-Reply-To: <20090126093449.3caaef53.weiny2@llnl.gov> References: <20090125115136.GA20419@sashak.voltaire.com> <20090125115249.GB20419@sashak.voltaire.com> <20090126091419.GA5814@sashak.voltaire.com> <20090126093449.3caaef53.weiny2@llnl.gov> Message-ID: <20090126175126.GJ5814@sashak.voltaire.com> On 09:34 Mon 26 Jan , Ira Weiny wrote: > This will break some of the internal scripts we have here which use "nodeinfo". I don't think so - smpquery commands are case insensitive and 'smpquery nodeinfo' will work exactly as 'smpquery NodeInfo', etc. > I don't have an objection to all this option changing, but we need to make it > clear that many options have changed throughout all the diags. What is changed? One of my goal was to preserve backward compatibility, the only thing was really changed for many tools is usage message (--help) format. Sasha From sashak at voltaire.com Mon Jan 26 09:57:53 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 26 Jan 2009 19:57:53 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags: command line option processing framework In-Reply-To: <1232990529.7773.47.camel@auk31.llnl.gov> References: <20090125115136.GA20419@sashak.voltaire.com> <1232990529.7773.47.camel@auk31.llnl.gov> Message-ID: <20090126175753.GK5814@sashak.voltaire.com> Hi Al, On 09:22 Mon 26 Jan , Al Chu wrote: > > Just wondering, what tools are you targeting for these unified command > line options? All tools in infiniband-diags (see the next posted patch). > Since (doing a quick glance) it will break some options > for some tools. > > for ibstat > > /usr/sbin/ibstat -l # list all IB devices '-l' is not a common option and for ibstat it will work as usual. Even when some tool redefines one of common options it can handle this in custom option processor and "to ban" its processing. > > I don't think there will be a lot of issues. Any changes will just have > to be announced/documented in the release notes so people can update > scripts. My idea was to not change an actual usage at all. And as far as I can see backward compatibility is preserved. Sasha From rdreier at cisco.com Mon Jan 26 09:59:26 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 Jan 2009 09:59:26 -0800 Subject: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem In-Reply-To: <20090125094506.GA19444@mtls03> (Eli Cohen's message of "Sun, 25 Jan 2009 11:45:06 +0200") References: <20090125094506.GA19444@mtls03> Message-ID: > add "address" field to struct ib_umem so low level drivers will have this > information which may be needed in order to correctly calculate the number of > huge pages. seems like a really strange thing to do: + umem->address = addr; this value addr is coming from the low-level driver, so I'm not clear why we need to stick it into umem? What am I missing? - R. From rdreier at cisco.com Mon Jan 26 10:07:55 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 Jan 2009 10:07:55 -0800 Subject: [ofa-general] Re: [PATCH v2] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <20090125094546.GA19466@mtls03> (Eli Cohen's message of "Sun, 25 Jan 2009 11:45:46 +0200") References: <20090125094546.GA19466@mtls03> Message-ID: > + n = PAGE_ALIGN(umem->length + (umem->address & ~HPAGE_MASK)) >> HPAGE_SHIFT; This is still wrong I think. What if the user, say, registers 1MB that is aligned exactly at a 2MB huge page start? Then wouldn't this expression set n to 0? I think that PAGE_ALIGN() needs to be a "HPAGE_ALIGN()", although of course that doesn't actually exist, so we would need to do "ALIGN(..., HPAGE_SIZE)" instead... but given that we're just going to do ">> HPAGE_SHIFT" afterwards, maybe "DIV_ROUND_UP(..., HPAGE_SIZE)" is a better way to write it. It maybe is needed to make sure that the virtual address requested is aligned appropriately for using bigger pages (although this should always be the case in the current code I guess, since the userspace virtual address always matches the HCA MR virtual address). - R. From rdreier at cisco.com Mon Jan 26 10:13:09 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 Jan 2009 10:13:09 -0800 Subject: [ofa-general] Re: potential device removal deadlock In-Reply-To: <497DF632.7060609@opengridcomputing.com> (Steve Wise's message of "Mon, 26 Jan 2009 11:43:14 -0600") References: <497DF632.7060609@opengridcomputing.com> Message-ID: > I'm looking at the rdma_[u]cm modules and how they generate > DEVICE_REMOVAL events to user applications, and I see a potential > deadlock. ib_unregister_device() calls the ib_client remove() > functions in the reverse order from which the ib_clients were > registered. And if you look at ib_uverbs_remove_one(), you'll see it > will block until all references from user apps are released. So if > ib_uverbs remove() gets called _before_ the rdma_cm remove() function, > then the unregister process will deadlock since applications don't get > notification of the device removal. > > Am I missing something, or is this a bug? Yes, looks that way. Making sure that rdma_cm is loaded after ib_uverbs works around it. > I would think ib_uverbs should actually blow away the kernel parts of > the user's handles allowing the device to be removed. Then the user > app will discover things went south on the next down call into the > uverbs code -or- by the DEVICE_REMOVAL rdma-cm event. Yes, but that's not that easy (eg need to shoot down mappings of PCI memory into all userspace processes, etc)... we punted on it when adding device removal support to uverbs. - R. From eli at dev.mellanox.co.il Mon Jan 26 10:28:40 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 26 Jan 2009 20:28:40 +0200 Subject: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem In-Reply-To: References: <20090125094506.GA19444@mtls03> Message-ID: <20090126182840.GA22015@mtls03> On Mon, Jan 26, 2009 at 09:59:26AM -0800, Roland Dreier wrote: > > seems like a really strange thing to do: > > + umem->address = addr; > > this value addr is coming from the low-level driver, so I'm not clear > why we need to stick it into umem? What am I missing? > It is has to be saved either at the low level driver's mr object, e.g. struct mlx4_ib_mr, or at a common place like struct ib_umem. Do you prefer that it will be saved in struct mlx4_ib_mr? From rdreier at cisco.com Mon Jan 26 10:26:25 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 26 Jan 2009 10:26:25 -0800 Subject: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem In-Reply-To: <20090126182840.GA22015@mtls03> (Eli Cohen's message of "Mon, 26 Jan 2009 20:28:40 +0200") References: <20090125094506.GA19444@mtls03> <20090126182840.GA22015@mtls03> Message-ID: > It is has to be saved either at the low level driver's mr object, > e.g. struct mlx4_ib_mr, or at a common place like struct ib_umem. Do > you prefer that it will be saved in struct mlx4_ib_mr? I don't see why it has to be saved anywhere? The only place you use umem->address is in handle_hugetlb_usermr(), and you could just as easily pass in start directly as a parameter (since mlx4_ib_reg_user_mr() has that value in a parameter anyway). - R. From sean.hefty at intel.com Mon Jan 26 10:30:01 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 26 Jan 2009 10:30:01 -0800 Subject: [ofa-general] RE: potential device removal deadlock In-Reply-To: <497DF632.7060609@opengridcomputing.com> References: <497DF632.7060609@opengridcomputing.com> Message-ID: <0C3689DD2D9D414C800680CB7FD8C09B@amr.corp.intel.com> >I'm looking at the rdma_[u]cm modules and how they generate >DEVICE_REMOVAL events to user applications, and I see a potential >deadlock. ib_unregister_device() calls the ib_client remove() functions >in the reverse order from which the ib_clients were registered. And if >you look at ib_uverbs_remove_one(), you'll see it will block until all >references from user apps are released. So if ib_uverbs remove() gets >called _before_ the rdma_cm remove() function, then the unregister >process will deadlock since applications don't get notification of the >device removal. You want the remove device functions called in the reverse order of registration. >I would think ib_uverbs should actually blow away the kernel parts of >the user's handles allowing the device to be removed. Then the user app >will discover things went south on the next down call into the uverbs >code -or- by the DEVICE_REMOVAL rdma-cm event. The ib_ucm and rdma_ucm should also blow away any kernel parts of user handles. - Sean From swise at opengridcomputing.com Mon Jan 26 10:58:57 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 26 Jan 2009 12:58:57 -0600 Subject: [ofa-general] Re: potential device removal deadlock In-Reply-To: <0C3689DD2D9D414C800680CB7FD8C09B@amr.corp.intel.com> References: <497DF632.7060609@opengridcomputing.com> <0C3689DD2D9D414C800680CB7FD8C09B@amr.corp.intel.com> Message-ID: <497E07F1.4040800@opengridcomputing.com> Sean Hefty wrote: >> I'm looking at the rdma_[u]cm modules and how they generate >> DEVICE_REMOVAL events to user applications, and I see a potential >> deadlock. ib_unregister_device() calls the ib_client remove() functions >> in the reverse order from which the ib_clients were registered. And if >> you look at ib_uverbs_remove_one(), you'll see it will block until all >> references from user apps are released. So if ib_uverbs remove() gets >> called _before_ the rdma_cm remove() function, then the unregister >> process will deadlock since applications don't get notification of the >> device removal. >> > > You want the remove device functions called in the reverse order of > registration. > > Not quite. You want rdma_cm to be notified of the device removal _before_ ib_uverbs... From swise at opengridcomputing.com Mon Jan 26 11:08:35 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 26 Jan 2009 13:08:35 -0600 Subject: [ofa-general] Re: potential device removal deadlock In-Reply-To: References: <497DF632.7060609@opengridcomputing.com> Message-ID: <497E0A33.2050308@opengridcomputing.com> Roland Dreier wrote: > > I'm looking at the rdma_[u]cm modules and how they generate > > DEVICE_REMOVAL events to user applications, and I see a potential > > deadlock. ib_unregister_device() calls the ib_client remove() > > functions in the reverse order from which the ib_clients were > > registered. And if you look at ib_uverbs_remove_one(), you'll see it > > will block until all references from user apps are released. So if > > ib_uverbs remove() gets called _before_ the rdma_cm remove() function, > > then the unregister process will deadlock since applications don't get > > notification of the device removal. > > > > Am I missing something, or is this a bug? > > Yes, looks that way. Making sure that rdma_cm is loaded after ib_uverbs > works around it. > How could we fix this in the kernel? Perhaps ib_uverbs should post an async error analgous to RDMA_CM_EVENT_DEVICE_REMOVAL? Maybe IB_EVENT_DEVICE_FATAL? In the case of EEH support of iw_cxgb3, I guess the driver could post this event. That would at least kick all the user apps... > > I would think ib_uverbs should actually blow away the kernel parts of > > the user's handles allowing the device to be removed. Then the user > > app will discover things went south on the next down call into the > > uverbs code -or- by the DEVICE_REMOVAL rdma-cm event. > > Yes, but that's not that easy (eg need to shoot down mappings of PCI > memory into all userspace processes, etc)... we punted on it when adding > device removal support to uverbs. > > This makes EEH support pretty painful. Stevo From tziporet at mellanox.co.il Mon Jan 26 12:33:03 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 26 Jan 2009 22:33:03 +0200 Subject: [ofa-general] OFED (EWG) meeting minutes forJan 26, 2009 In-Reply-To: <497D8210.5040203@mellanox.co.il> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0FE807@mtlexch01.mtl.com> These are the OFED (EWG) meeting minutes for Jan 26 on OFED future plans Meeting Summary: ============== 1. We plan 1.4.1 point release. Tentative schedule - end of March 2. OFED 1.5 will be based over kernel 2.6.30 3. OFED 1.5 schedule will be end of July/beginning of Aug 4. We are moving to the new OFA server starting with git and build this week - Jeff B. & Vlad Details: ====== 1. 1.4.1 release There is a consensus that we should have 1.4.1 release to support new OSES. There was an agreement to include Open MPI 1.3. release Based on this main changes from 1.4 to 1.4.1 should be: 1. New OSes: RH 5.3 and SLES 11 2. Open MPI 1.3 (if 1.3.1 will be out in the time frame we will take it) 3. RDS with iWARP support 4. NFS/RDMA backports - at least to RH 5.2/3 and SLES 10 SP2 5. Critical bug fixes Tentative Schedule: RC1 at end of Feb GA release at end of March AIs: Tziporet: Find out when SLES 11 is planed for GA to see if 1.4.1 can include it All: Check with their QA on the possibility to do a QA in the suggested time frame Status of changes: RH 5.3 - backports should be ready tomorrow RDS over iWARP - Steve is working with Andy from Oracle. They are in bug fixing phase. Should be ready to run Oracle 11 in 2 weeks NFS/RDMA - backport work is going slow. Already got to 2.6.25. Hope they can have RH backports running in few weeks and them SLES 10 SP2 following. Open MPI 1.3 - GA is ready, need to fix something in the install for OFED 1.4 - Jeff please send a patch for this SLES 11 - we will prepare backports now based on the Rc2 code - should be ready next week. 2. OFED 1.5 - Kernel base: Since we are doing 1.4.1, 1.5 is going to be pushed in schedule thus we should use kernel 2.6.30 as kernel base. - Schedule: release is planned for end of July / beginning of August - Tziporet will publish the detailed plan after we will have a final date of 1.4.1 release - Features: Bill Boas wanted to see RDMAoEth on the release. A discussion on the features list will be discusses in next meeting 3. Move to the new OFA server - We will start by moving git and the build system. Jeff B will work with Vlad on this. Should be done this week and completed next week. - Mail server, mailing lists and bugzilla can be moved later in the background 4. Bill Boas reported he will start joining all EWG meeting as part of his new role as OFA director. Bill presented his view his view on important subjects that should be covered in Sonoma, and stated that OFA should take care of more subjects like virtualization, Low latency Ethernet, FCoE and more. Tziporet From hartlch14 at gmail.com Mon Jan 26 12:44:32 2009 From: hartlch14 at gmail.com (Chuck Hartley) Date: Mon, 26 Jan 2009 15:44:32 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Unable to get IPoIB working In-Reply-To: References: Message-ID: Forgot to do reply all... Well, changing the MTU and using datagram mode did not fix things. I suspect that the problem lies in the difference between what ifconfig shows and what the ip command shows. The fact that ip says "NO-CARRIER" and "state DOWN" seems to indicate that the interface is not completely functional in the IPoIB world. I don't understand why ip thinks there is no carrier, yet the status lights on the HCAs and switch are ok and everyone seems to be communicating ok on the InfiniBand side of things. I did some further experimentation and hooked up the other InfiniHost HCA to our existing network. I see the same symptoms there, i.e. "ip show device" shows no carrier and the interface state is down. Does this indicate that there is some problem with IPoIB on Fedora9? We are running kernel 2.6.26.6-79.fc9.x86_64. Is there some IPoIB debug I can turn on somehow? From boris at mellanox.com Mon Jan 26 12:48:33 2009 From: boris at mellanox.com (Boris Shpolyansky) Date: Mon, 26 Jan 2009 12:48:33 -0800 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Unable to get IPoIB working References: Message-ID: <1E3DCD1C63492545881FACB6063A57C1035D9347@mtiexch01> Please, make sure to have active subnet manager on your InfiniBand fabric - it is essential for IPoIB operation. Boris Shpolyansky Sr. Member of Technical Staff, Applications Mellanox Technologies Inc. 350 Oakmead Parkway Sunnyvale, CA 94085 Tel.: (408) 916 0014 Fax: (408) 585 0314 Cell: (408) 834 9365 www.mellanox.com -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Chuck Hartley Sent: Monday, January 26, 2009 12:45 PM To: OpenFabrics General Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Unable to get IPoIB working Forgot to do reply all... Well, changing the MTU and using datagram mode did not fix things. I suspect that the problem lies in the difference between what ifconfig shows and what the ip command shows. The fact that ip says "NO-CARRIER" and "state DOWN" seems to indicate that the interface is not completely functional in the IPoIB world. I don't understand why ip thinks there is no carrier, yet the status lights on the HCAs and switch are ok and everyone seems to be communicating ok on the InfiniBand side of things. I did some further experimentation and hooked up the other InfiniHost HCA to our existing network. I see the same symptoms there, i.e. "ip show device" shows no carrier and the interface state is down. Does this indicate that there is some problem with IPoIB on Fedora9? We are running kernel 2.6.26.6-79.fc9.x86_64. Is there some IPoIB debug I can turn on somehow? _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hal.rosenstock at gmail.com Mon Jan 26 12:49:12 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 26 Jan 2009 15:49:12 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Unable to get IPoIB working In-Reply-To: References: Message-ID: On Mon, Jan 26, 2009 at 3:44 PM, Chuck Hartley wrote: > Forgot to do reply all... > > Well, changing the MTU and using datagram mode did not fix things. > > I suspect that the problem lies in the difference between what > ifconfig shows and what the ip command shows. The fact that ip says > "NO-CARRIER" and "state DOWN" seems to indicate that the interface is > not completely functional in the IPoIB world. > I don't understand why ip thinks there is no carrier, yet the status > lights on the HCAs and > switch are ok and everyone seems to be communicating ok on the > InfiniBand side of things. > > I did some further experimentation and hooked up the other InfiniHost > HCA to our existing network. I see the same symptoms there, i.e. "ip > show device" shows no carrier and the interface state is down. Do your CA ports show active for port state ? Is an SM running in your subnet ? -- Hal > Does this indicate that there is some problem with IPoIB on Fedora9? > We are running kernel 2.6.26.6-79.fc9.x86_64. > > Is there some IPoIB debug I can turn on somehow? > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hartlch14 at gmail.com Mon Jan 26 12:53:13 2009 From: hartlch14 at gmail.com (Chuck Hartley) Date: Mon, 26 Jan 2009 15:53:13 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] ***SPAM*** Unable to get IPoIB working In-Reply-To: References: Message-ID: On Mon, Jan 26, 2009 at 3:49 PM, Hal Rosenstock wrote: > > Do your CA ports show active for port state ? > > Is an SM running in your subnet ? > > -- Hal > Yes, I am running the OFED SM. And the CA ports show active: # ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0002c90300032de0 System image GUID: 0x0002c90300032de3 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 4 LMC: 0 SM lid: 4 Capability mask: 0x0251086a Port GUID: 0x0002c90300032de1 Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300032de2 From hal.rosenstock at gmail.com Mon Jan 26 12:55:45 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 26 Jan 2009 15:55:45 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] ***SPAM*** Unable to get IPoIB working In-Reply-To: References: Message-ID: On Mon, Jan 26, 2009 at 3:53 PM, Chuck Hartley wrote: > On Mon, Jan 26, 2009 at 3:49 PM, Hal Rosenstock > wrote: > >> >> Do your CA ports show active for port state ? >> >> Is an SM running in your subnet ? >> >> -- Hal >> > > Yes, I am running the OFED SM. And the CA ports show active: Any error messages in the OpenSM log ? > # ibstat > CA 'mlx4_0' > CA type: MT26428 > Number of ports: 2 > Firmware version: 2.6.0 > Hardware version: a0 > Node GUID: 0x0002c90300032de0 > System image GUID: 0x0002c90300032de3 > Port 1: > State: Active > Physical state: LinkUp > Rate: 40 > Base lid: 4 > LMC: 0 > SM lid: 4 > Capability mask: 0x0251086a > Port GUID: 0x0002c90300032de1 > Port 2: > State: Down > Physical state: Polling > Rate: 10 > Base lid: 0 > LMC: 0 > SM lid: 0 > Capability mask: 0x02510868 > Port GUID: 0x0002c90300032de2 > From adit.262 at gmail.com Mon Jan 26 13:08:35 2009 From: adit.262 at gmail.com (Adit Ranadive) Date: Mon, 26 Jan 2009 16:08:35 -0500 Subject: [ofa-general] ***SPAM*** Byte_Cnt field in the MTHCA_CQE structure Message-ID: Hello, I have been looking at doing some low level work with the OFED library 1.1 in terms of figuring out how many bytes have been sent by an IB application. I came across the byte_cnt field inside the MTHCA_CQE structure and was wondering if this indicates the number of bytes that are sent/received? >From the cq.c file in the libmthca/src folder it looks like the value is copied from the CQE structure to the WQE struct in the case of RDMA reads but not RDMA writes. Is there a reason for this? Has this been implemented in newer versions of OFED? Im currently working with OFED-1.1. When using the performance tests inside OFED the value of the byte_cnt field is set to the buffer size in case of RDMA read but to a very large value >> buffer size in case of RDMA Write. Im using a Mellanox HCA MT25208 Infinihost III Ex Nic with Firmware 4.8.200 Any thoughts on this would be greatly appreciated. Thanks, Adit Adit Ranadive PhD CS Student Georgia Institute of Technology, Atlanta, GA -------------- next part -------------- An HTML attachment was scrubbed... URL: From hartlch14 at gmail.com Mon Jan 26 13:10:07 2009 From: hartlch14 at gmail.com (Chuck Hartley) Date: Mon, 26 Jan 2009 16:10:07 -0500 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] ***SPAM*** Unable to get IPoIB working In-Reply-To: References: Message-ID: On Mon, Jan 26, 2009 at 3:55 PM, Hal Rosenstock wrote: > Any error messages in the OpenSM log ? > No, I restarted the SM on both hosts and it came up cleanly in both cases. From sashak at voltaire.com Mon Jan 26 13:26:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 26 Jan 2009 23:26:21 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/ibsysstat: use RMPP for client/server communication Message-ID: <20090126212621.GL5814@sashak.voltaire.com> This patch adds support for bigger than (256 - vendor2 data offset) data sending by ibsysstat server using RMPP. It fixes bug#1237 - where server output was truncated due to MAD size limitation. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/ibsysstat.c | 128 +++++++++++++++++++++++++++++-------- 1 files changed, 100 insertions(+), 28 deletions(-) diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index 5a39455..c20a6f0 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -62,11 +62,57 @@ typedef struct cpu_info { static cpu_info cpus[MAX_CPUS]; static int host_ncpu; -static void -mk_reply(int attr, void *data, int sz) +static int server_respond(void *umad, int size) +{ + ib_rpc_t rpc = {0}; + ib_rmpp_hdr_t rmpp = {0}; + ib_portid_t rport; + uint8_t *mad = umad_get_mad(umad); + ib_mad_addr_t *mad_addr; + + if (!(mad_addr = umad_get_mad_addr(umad))) + return -1; + + memset(&rport, 0, sizeof(rport)); + + rport.lid = ntohs(mad_addr->lid); + rport.qp = ntohl(mad_addr->qpn); + rport.qkey = ntohl(mad_addr->qkey); + rport.sl = mad_addr->sl; + if (!rport.qkey && rport.qp == 1) + rport.qkey = IB_DEFAULT_QP1_QKEY; + + rpc.mgtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F); + rpc.method = IB_MAD_METHOD_GET | IB_MAD_RESPONSE; + rpc.attr.id = mad_get_field(mad, 0, IB_MAD_ATTRID_F); + rpc.attr.mod = mad_get_field(mad, 0, IB_MAD_ATTRMOD_F); + rpc.oui = mad_get_field(mad, 0, IB_VEND2_OUI_F); + rpc.trid = mad_get_field64(mad, 0, IB_MAD_TRID_F); + + rmpp.flags = IB_RMPP_FLAG_ACTIVE; + + DEBUG("responding %d bytes to %s, attr 0x%x mod 0x%x qkey %x", + size, portid2str(&rport), rpc.attr.id, rpc.attr.mod, rport.qkey); + + if (mad_build_pkt(umad, &rpc, &rport, &rmpp, 0) < 0) + return -1; + + if (ibdebug > 1) + xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE); + + if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad, + size, rpc.timeout, 0) < 0) { + DEBUG("send failed; %m"); + return -1; + } + + return 0; +} + +static int mk_reply(int attr, void *data, int sz) { char *s = data; - int n, i; + int n, i, ret = 0; switch (attr) { case IB_PING_ATTR: @@ -75,43 +121,54 @@ mk_reply(int attr, void *data, int sz) if (gethostname(s, sz) < 0) snprintf(s, sz, "?hostname?"); s[sz-1] = 0; - if ((n = strlen(s)) >= sz) + if ((n = strlen(s)) >= sz - 1) { + ret = sz; break; + } s[n] = '.'; s += n+1; sz -= n+1; + ret += n + 1; if (getdomainname(s, sz) < 0) snprintf(s, sz, "?domainname?"); - if (strlen(s) == 0) + if ((n = strlen(s)) == 0) s[-1] = 0; /* no domain */ + else + ret += n; break; case IB_CPUINFO_ATTR: + s[0] = '\0'; for (i = 0; i < host_ncpu && sz > 0; i++) { n = snprintf(s, sz, "cpu %d: model %s MHZ %s\n", i, cpus[i].model, cpus[i].mhz); if (n >= sz) { IBWARN("cpuinfo truncated"); + ret = sz; break; } sz -= n; s += n; + ret += n; } + ret++; break; default: DEBUG("unknown attr %d", attr); } + return ret; } -static char * -ibsystat_serv(void) +static uint8_t buf[2048]; + +static char *ibsystat_serv(void) { void *umad; void *mad; - int attr, mod; + int attr, mod, size; DEBUG("starting to serve..."); - while ((umad = mad_receive(0, -1))) { + while ((umad = mad_receive(buf, -1))) { mad = umad_get_mad(umad); @@ -120,12 +177,11 @@ ibsystat_serv(void) DEBUG("got packet: attr 0x%x mod 0x%x", attr, mod); - mk_reply(attr, (char *)mad + IB_VENDOR_RANGE2_DATA_OFFS, IB_VENDOR_RANGE2_DATA_SIZE); + size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS, + sizeof(buf) - umad_size() - IB_VENDOR_RANGE2_DATA_OFFS); - if (mad_respond(umad, 0, 0) < 0) + if (server_respond(umad, IB_VENDOR_RANGE2_DATA_OFFS + size) < 0) DEBUG("respond failed"); - - mad_free(umad); } DEBUG("server out"); @@ -144,24 +200,40 @@ match_attr(char *str) return -1; } -static char * -ibsystat(ib_portid_t *portid, int attr) +static char *ibsystat(ib_portid_t *portid, int attr) { - char data[IB_VENDOR_RANGE2_DATA_SIZE] = {0}; - ib_vendor_call_t call; + ib_rpc_t rpc = { 0 }; + int fd, agent, timeout, len; + void *data = umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS; DEBUG("Sysstat ping.."); - call.method = IB_MAD_METHOD_GET; - call.mgmt_class = IB_VENDOR_OPENIB_SYSSTAT_CLASS; - call.attrid = attr; - call.mod = 0; - call.oui = IB_OPENIB_OUI; - call.timeout = 0; - memset(&call.rmpp, 0, sizeof call.rmpp); + rpc.mgtclass = IB_VENDOR_OPENIB_SYSSTAT_CLASS; + rpc.method = IB_MAD_METHOD_GET; + rpc.attr.id = attr; + rpc.attr.mod = 0; + rpc.oui = IB_OPENIB_OUI; + rpc.timeout = 0; + rpc.datasz = IB_VENDOR_RANGE2_DATA_SIZE; + rpc.dataoffs = IB_VENDOR_RANGE2_DATA_OFFS; + + portid->qp = 1; + if (!portid->qkey) + portid->qkey = IB_DEFAULT_QP1_QKEY; + + if ((len = mad_build_pkt(buf, &rpc, portid, NULL, NULL)) < 0) + IBPANIC("cannot build packet."); + + fd = madrpc_portid(); + agent = mad_class_agent(rpc.mgtclass); + timeout = ibd_timeout ? ibd_timeout : MAD_DEF_TIMEOUT_MS; + + if (umad_send(fd, agent, buf, len, timeout, 0) < 0) + IBPANIC("umad_send failed."); - if (!ib_vendor_call(data, portid, &call)) - return "vendor call failed"; + len = sizeof(buf) - umad_size(); + if (umad_recv(fd, buf, &len, timeout) < 0) + IBPANIC("umad_recv failed."); DEBUG("Got sysstat pong.."); if (attr != IB_PING_ATTR) @@ -256,7 +328,7 @@ int main(int argc, char **argv) madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); if (server) { - if (mad_register_server(sysstat_class, 0, 0, oui) < 0) + if (mad_register_server(sysstat_class, 1, 0, oui) < 0) IBERROR("can't serve class %d", sysstat_class); host_ncpu = build_cpuinfo(); @@ -266,7 +338,7 @@ int main(int argc, char **argv) exit(0); } - if (mad_register_client(sysstat_class, 0) < 0) + if (mad_register_client(sysstat_class, 1) < 0) IBERROR("can't register to sysstat class %d", sysstat_class); if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) -- 1.6.0.4.766.g6fc4a From michael.heinz at qlogic.com Mon Jan 26 14:13:42 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Mon, 26 Jan 2009 16:13:42 -0600 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** I feel like I'm in a Monty Python Skit ***SPAM*** In-Reply-To: References: Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74E5D14E@MNEXMB1.qlogic.org> Does anyone know whose spam-detector is flagging so many OFED messages as spam? From hal.rosenstock at gmail.com Mon Jan 26 14:14:54 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 26 Jan 2009 17:14:54 -0500 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsysstat: use RMPP for client/server communication In-Reply-To: <20090126212621.GL5814@sashak.voltaire.com> References: <20090126212621.GL5814@sashak.voltaire.com> Message-ID: On Mon, Jan 26, 2009 at 4:26 PM, Sasha Khapyorsky wrote: > > This patch adds support for bigger than (256 - vendor2 data offset) data > sending by ibsysstat server using RMPP. It fixes bug#1237 - where server > output was truncated due to MAD size limitation. Seems like the class version should be bumped for this change. What's the behavior of old client with new server and new client with old server ? -- Hal > Signed-off-by: Sasha Khapyorsky > --- > infiniband-diags/src/ibsysstat.c | 128 +++++++++++++++++++++++++++++-------- > 1 files changed, 100 insertions(+), 28 deletions(-) > > diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c > index 5a39455..c20a6f0 100644 > --- a/infiniband-diags/src/ibsysstat.c > +++ b/infiniband-diags/src/ibsysstat.c > @@ -62,11 +62,57 @@ typedef struct cpu_info { > static cpu_info cpus[MAX_CPUS]; > static int host_ncpu; > > -static void > -mk_reply(int attr, void *data, int sz) > +static int server_respond(void *umad, int size) > +{ > + ib_rpc_t rpc = {0}; > + ib_rmpp_hdr_t rmpp = {0}; > + ib_portid_t rport; > + uint8_t *mad = umad_get_mad(umad); > + ib_mad_addr_t *mad_addr; > + > + if (!(mad_addr = umad_get_mad_addr(umad))) > + return -1; > + > + memset(&rport, 0, sizeof(rport)); > + > + rport.lid = ntohs(mad_addr->lid); > + rport.qp = ntohl(mad_addr->qpn); > + rport.qkey = ntohl(mad_addr->qkey); > + rport.sl = mad_addr->sl; > + if (!rport.qkey && rport.qp == 1) > + rport.qkey = IB_DEFAULT_QP1_QKEY; > + > + rpc.mgtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F); > + rpc.method = IB_MAD_METHOD_GET | IB_MAD_RESPONSE; > + rpc.attr.id = mad_get_field(mad, 0, IB_MAD_ATTRID_F); > + rpc.attr.mod = mad_get_field(mad, 0, IB_MAD_ATTRMOD_F); > + rpc.oui = mad_get_field(mad, 0, IB_VEND2_OUI_F); > + rpc.trid = mad_get_field64(mad, 0, IB_MAD_TRID_F); > + > + rmpp.flags = IB_RMPP_FLAG_ACTIVE; > + > + DEBUG("responding %d bytes to %s, attr 0x%x mod 0x%x qkey %x", > + size, portid2str(&rport), rpc.attr.id, rpc.attr.mod, rport.qkey); > + > + if (mad_build_pkt(umad, &rpc, &rport, &rmpp, 0) < 0) > + return -1; > + > + if (ibdebug > 1) > + xdump(stderr, "mad respond pkt\n", mad, IB_MAD_SIZE); > + > + if (umad_send(madrpc_portid(), mad_class_agent(rpc.mgtclass), umad, > + size, rpc.timeout, 0) < 0) { > + DEBUG("send failed; %m"); > + return -1; > + } > + > + return 0; > +} > + > +static int mk_reply(int attr, void *data, int sz) > { > char *s = data; > - int n, i; > + int n, i, ret = 0; > > switch (attr) { > case IB_PING_ATTR: > @@ -75,43 +121,54 @@ mk_reply(int attr, void *data, int sz) > if (gethostname(s, sz) < 0) > snprintf(s, sz, "?hostname?"); > s[sz-1] = 0; > - if ((n = strlen(s)) >= sz) > + if ((n = strlen(s)) >= sz - 1) { > + ret = sz; > break; > + } > s[n] = '.'; > s += n+1; > sz -= n+1; > + ret += n + 1; > if (getdomainname(s, sz) < 0) > snprintf(s, sz, "?domainname?"); > - if (strlen(s) == 0) > + if ((n = strlen(s)) == 0) > s[-1] = 0; /* no domain */ > + else > + ret += n; > break; > case IB_CPUINFO_ATTR: > + s[0] = '\0'; > for (i = 0; i < host_ncpu && sz > 0; i++) { > n = snprintf(s, sz, "cpu %d: model %s MHZ %s\n", > i, cpus[i].model, cpus[i].mhz); > if (n >= sz) { > IBWARN("cpuinfo truncated"); > + ret = sz; > break; > } > sz -= n; > s += n; > + ret += n; > } > + ret++; > break; > default: > DEBUG("unknown attr %d", attr); > } > + return ret; > } > > -static char * > -ibsystat_serv(void) > +static uint8_t buf[2048]; > + > +static char *ibsystat_serv(void) > { > void *umad; > void *mad; > - int attr, mod; > + int attr, mod, size; > > DEBUG("starting to serve..."); > > - while ((umad = mad_receive(0, -1))) { > + while ((umad = mad_receive(buf, -1))) { > > mad = umad_get_mad(umad); > > @@ -120,12 +177,11 @@ ibsystat_serv(void) > > DEBUG("got packet: attr 0x%x mod 0x%x", attr, mod); > > - mk_reply(attr, (char *)mad + IB_VENDOR_RANGE2_DATA_OFFS, IB_VENDOR_RANGE2_DATA_SIZE); > + size = mk_reply(attr, mad + IB_VENDOR_RANGE2_DATA_OFFS, > + sizeof(buf) - umad_size() - IB_VENDOR_RANGE2_DATA_OFFS); > > - if (mad_respond(umad, 0, 0) < 0) > + if (server_respond(umad, IB_VENDOR_RANGE2_DATA_OFFS + size) < 0) > DEBUG("respond failed"); > - > - mad_free(umad); > } > > DEBUG("server out"); > @@ -144,24 +200,40 @@ match_attr(char *str) > return -1; > } > > -static char * > -ibsystat(ib_portid_t *portid, int attr) > +static char *ibsystat(ib_portid_t *portid, int attr) > { > - char data[IB_VENDOR_RANGE2_DATA_SIZE] = {0}; > - ib_vendor_call_t call; > + ib_rpc_t rpc = { 0 }; > + int fd, agent, timeout, len; > + void *data = umad_get_mad(buf) + IB_VENDOR_RANGE2_DATA_OFFS; > > DEBUG("Sysstat ping.."); > > - call.method = IB_MAD_METHOD_GET; > - call.mgmt_class = IB_VENDOR_OPENIB_SYSSTAT_CLASS; > - call.attrid = attr; > - call.mod = 0; > - call.oui = IB_OPENIB_OUI; > - call.timeout = 0; > - memset(&call.rmpp, 0, sizeof call.rmpp); > + rpc.mgtclass = IB_VENDOR_OPENIB_SYSSTAT_CLASS; > + rpc.method = IB_MAD_METHOD_GET; > + rpc.attr.id = attr; > + rpc.attr.mod = 0; > + rpc.oui = IB_OPENIB_OUI; > + rpc.timeout = 0; > + rpc.datasz = IB_VENDOR_RANGE2_DATA_SIZE; > + rpc.dataoffs = IB_VENDOR_RANGE2_DATA_OFFS; > + > + portid->qp = 1; > + if (!portid->qkey) > + portid->qkey = IB_DEFAULT_QP1_QKEY; > + > + if ((len = mad_build_pkt(buf, &rpc, portid, NULL, NULL)) < 0) > + IBPANIC("cannot build packet."); > + > + fd = madrpc_portid(); > + agent = mad_class_agent(rpc.mgtclass); > + timeout = ibd_timeout ? ibd_timeout : MAD_DEF_TIMEOUT_MS; > + > + if (umad_send(fd, agent, buf, len, timeout, 0) < 0) > + IBPANIC("umad_send failed."); > > - if (!ib_vendor_call(data, portid, &call)) > - return "vendor call failed"; > + len = sizeof(buf) - umad_size(); > + if (umad_recv(fd, buf, &len, timeout) < 0) > + IBPANIC("umad_recv failed."); > > DEBUG("Got sysstat pong.."); > if (attr != IB_PING_ATTR) > @@ -256,7 +328,7 @@ int main(int argc, char **argv) > madrpc_init(ibd_ca, ibd_ca_port, mgmt_classes, 3); > > if (server) { > - if (mad_register_server(sysstat_class, 0, 0, oui) < 0) > + if (mad_register_server(sysstat_class, 1, 0, oui) < 0) > IBERROR("can't serve class %d", sysstat_class); > > host_ncpu = build_cpuinfo(); > @@ -266,7 +338,7 @@ int main(int argc, char **argv) > exit(0); > } > > - if (mad_register_client(sysstat_class, 0) < 0) > + if (mad_register_client(sysstat_class, 1) < 0) > IBERROR("can't register to sysstat class %d", sysstat_class); > > if (ib_resolve_portid_str(&portid, argv[0], ibd_dest_type, ibd_sm_id) < 0) > -- > 1.6.0.4.766.g6fc4a > > From weiny2 at llnl.gov Mon Jan 26 15:33:47 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 26 Jan 2009 15:33:47 -0800 Subject: [ofa-general] Re: [PATCH 1/3 - no ibcommon] Create a new library libibnetdisc In-Reply-To: <20090121144923.GD3479@sashak.voltaire.com> References: <20090109154749.4b19c8bf.weiny2@llnl.gov> <20090121144923.GD3479@sashak.voltaire.com> Message-ID: <20090126153347.703ae0c3.weiny2@llnl.gov> Sasha, I have integrated your comments except for the responses below... On Wed, 21 Jan 2009 16:49:23 +0200 Sasha Khapyorsky wrote: > On 15:47 Fri 09 Jan , Ira Weiny wrote: > > > > Signed-off-by: weiny2 at llnl.gov > > > + > > diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h > > new file mode 100644 > > index 0000000..89f238f > > --- /dev/null > > +++ b/infiniband-diags/libibnetdisc/src/internal.h > > @@ -0,0 +1,82 @@ > > +/* > > + * Copyright (c) 2008 Lawrence Livermore National Laboratory > > + * > > + * This software is available to you under a choice of one of two > > + * licenses. You may choose to be licensed under the terms of the GNU > > + * General Public License (GPL) Version 2, available from the file > > + * COPYING in the main directory of this source tree, or the > > + * OpenIB.org BSD license below: > > + * > > + * Redistribution and use in source and binary forms, with or > > + * without modification, are permitted provided that the following > > + * conditions are met: > > + * > > + * - Redistributions of source code must retain the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer. > > + * > > + * - Redistributions in binary form must reproduce the above > > + * copyright notice, this list of conditions and the following > > + * disclaimer in the documentation and/or other materials > > + * provided with the distribution. > > + * > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > > + * SOFTWARE. > > + * > > + */ > > + > > +/** ========================================================================= > > + * Define the internal data structures. > > + */ > > + > > +#ifndef _INTERNAL_H_ > > +#define _INTERNAL_H_ > > + > > +#include > > + > > +struct ibnd_node { > > + /* This member MUST BE FIRST */ > > + ibnd_node_t node; > > + > > + /* internal use only */ > > + unsigned char ch_found; > > + struct ibnd_node *htnext; /* hash table list */ > > + struct ibnd_node *dnext; /* nodesdist next */ > > + struct ibnd_node *type_next; /* next based on type */ > > +}; > > +#define CONV_NODE_INTERNAL(node) ((struct ibnd_node *)node) > > + > > +struct ibnd_port { > > + /* This member MUST BE FIRST */ > > + ibnd_port_t port; > > + > > + /* internal use only */ > > + struct ibnd_port *htnext; > > +}; > > +#define CONV_PORT_INTERNAL(port) ((struct ibnd_port *)port) > > + > > +struct ibnd_fabric { > > + /* This member MUST BE FIRST */ > > + ibnd_fabric_t fabric; > > + > > + /* internal use only */ > > + void *ibmad_port; > > + struct ibnd_node *nodestbl[HTSZ]; > > + struct ibnd_port *portstbl[HTSZ]; > > + struct ibnd_node *nodesdist[MAXHOPS+1]; > > + ibnd_chassis_t *first_chassis; > > + ibnd_chassis_t *current_chassis; > > + ibnd_chassis_t *last_chassis; > > + struct ibnd_node *switches; > > + struct ibnd_node *ch_adapters; > > + struct ibnd_node *routers; > > +}; > > +#define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) > > Why should we hide this internal data so hard? > > Maybe we can use a single structures, to mark those fields (in comment) > as "for internal use" - somebody may want to use it. At first I did mark "for internal use", however, as I worked on the library more I realized just how important this data is to the library. If the user should misuse the data then the library will fail. I felt it was better to keep this house keeping internal. For now I plan on keeping it internal and we can discuss more later. > > > + > > +#endif /* _INTERNAL_H_ */ > > diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map > > new file mode 100644 > > index 0000000..5e8c315 > > --- /dev/null > > +++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map > > @@ -0,0 +1,27 @@ > > +IBNETDISC_1.0 { > > + global: > > + ibnd_debug; > > + ibnd_show_progress; > > + ibnd_discover_fabric; > > + ibnd_cache_fabric; > > + ibnd_read_fabric; > > + ibnd_destroy_fabric; > > + ibnd_find_node_guid; > > + ibnd_update_node; > > Where is ibnd_update_node() useful (in public API)? I have not used it in the diags yet but I have plans to use it in a deamon which will update nodes without rescaning the entire fabric. I can remove it if you really want but I feel this is the first step to using the library somewhere other than in "single run" tools. Ira From randy.dunlap at oracle.com Mon Jan 26 16:34:11 2009 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Mon, 26 Jan 2009 16:34:11 -0800 Subject: [ofa-general] [PATCH resend/lost] maintainers: moderated mailing list Message-ID: <20090126163411.7aeaff7f.randy.dunlap@oracle.com> Date: Thu, 23 Oct 2008 18:38:05 -0700 From: Randy Dunlap To: lkml Cc: general at lists.openfabrics.org, rolandd at cisco.com, akpm Subject: [PATCH] maintainers: moderated mailing list From: Randy Dunlap I got the "list is moderated message," so add it here. Signed-off-by: Randy Dunlap --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-next-20081023.orig/MAINTAINERS +++ linux-next-20081023/MAINTAINERS @@ -2144,7 +2144,7 @@ P: Sean Hefty M: sean.hefty at intel.com P: Hal Rosenstock M: hal.rosenstock at gmail.com -L: general at lists.openfabrics.org +L: general at lists.openfabrics.org (moderated for non-subscribers) W: http://www.openib.org/ T: git kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git S: Supported -- From rdunlap at xenotime.net Mon Jan 26 16:33:03 2009 From: rdunlap at xenotime.net (Randy Dunlap) Date: Mon, 26 Jan 2009 16:33:03 -0800 Subject: [ofa-general] is this list still moderated for non-subscribers? Message-ID: <497E563F.7000900@xenotime.net> Regards, --- ~Randy From andy.grover at oracle.com Mon Jan 26 18:17:37 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:37 -0800 Subject: [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) Message-ID: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Hi Roland, This patchset adds support for RDS as an Infiniband ULP. RDS is an Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably, and is used currently in Oracle RAC and Exadata products. It's lived in OFED for 2+ years and I think it's time to get it upstream -- most likely into your -next tree for .30, but if it snuck into .29 via the "new code merge-window exception" then even better. I've run checkpatch & sparse to clean up as many issues as possible so what remains are really the design peculiarities (aka warts) that arise from being a protocol designed by one company for a single critical application. I think upstreaming this code is the first step towards working out those issues, and making the end result available to a wider audience. Also available for review at: git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6 for-roland Thoughts? shortlog follows. Thanks -- Regards -- Andy Andy Grover (21): RDS: Socket interface RDS: Main header file RDS: Congestion-handling code RDS: Transport code RDS: Info and stats RDS: Connection handling RDS: loopback RDS: sysctls RDS: Message parsing RDS: send.c RDS: recv.c RDS: RDMA support RDS/IB: Infiniband transport RDS/IB: Ring-handling code. RDS/IB: Implement RDMA ops using FMRs RDS/IB: Implement IB-specific datagram send. RDS/IB: Receive datagrams via IB RDS/IB: Stats and sysctls RDS: Documentation RDS: Kconfig and Makefile RDS: Add AF and PF #defines for RDS sockets Documentation/networking/rds.txt | 356 +++++++++++ drivers/infiniband/Kconfig | 2 + drivers/infiniband/Makefile | 1 + drivers/infiniband/ulp/rds/Kconfig | 13 + drivers/infiniband/ulp/rds/Makefile | 13 + drivers/infiniband/ulp/rds/af_rds.c | 677 +++++++++++++++++++++ drivers/infiniband/ulp/rds/bind.c | 202 +++++++ drivers/infiniband/ulp/rds/cong.c | 424 +++++++++++++ drivers/infiniband/ulp/rds/connection.c | 501 +++++++++++++++ drivers/infiniband/ulp/rds/ib.c | 312 ++++++++++ drivers/infiniband/ulp/rds/ib.h | 358 +++++++++++ drivers/infiniband/ulp/rds/ib_cm.c | 882 +++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/ib_rdma.c | 641 ++++++++++++++++++++ drivers/infiniband/ulp/rds/ib_recv.c | 894 +++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/ib_ring.c | 168 +++++ drivers/infiniband/ulp/rds/ib_send.c | 852 ++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/ib_stats.c | 95 +++ drivers/infiniband/ulp/rds/ib_sysctl.c | 137 +++++ drivers/infiniband/ulp/rds/info.c | 243 ++++++++ drivers/infiniband/ulp/rds/info.h | 43 ++ drivers/infiniband/ulp/rds/loop.c | 189 ++++++ drivers/infiniband/ulp/rds/loop.h | 9 + drivers/infiniband/ulp/rds/message.c | 414 +++++++++++++ drivers/infiniband/ulp/rds/page.c | 222 +++++++ drivers/infiniband/ulp/rds/rdma.c | 682 +++++++++++++++++++++ drivers/infiniband/ulp/rds/rdma.h | 84 +++ drivers/infiniband/ulp/rds/rds.h | 763 +++++++++++++++++++++++ drivers/infiniband/ulp/rds/rds_rdma.h | 245 ++++++++ drivers/infiniband/ulp/rds/recv.c | 550 +++++++++++++++++ drivers/infiniband/ulp/rds/send.c | 1006 +++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/stats.c | 150 +++++ drivers/infiniband/ulp/rds/sysctl.c | 164 +++++ drivers/infiniband/ulp/rds/threads.c | 273 +++++++++ drivers/infiniband/ulp/rds/transport.c | 134 ++++ include/linux/socket.h | 4 +- 35 files changed, 11702 insertions(+), 1 deletions(-) create mode 100644 Documentation/networking/rds.txt create mode 100644 drivers/infiniband/ulp/rds/Kconfig create mode 100644 drivers/infiniband/ulp/rds/Makefile create mode 100644 drivers/infiniband/ulp/rds/af_rds.c create mode 100644 drivers/infiniband/ulp/rds/bind.c create mode 100644 drivers/infiniband/ulp/rds/cong.c create mode 100644 drivers/infiniband/ulp/rds/connection.c create mode 100644 drivers/infiniband/ulp/rds/ib.c create mode 100644 drivers/infiniband/ulp/rds/ib.h create mode 100644 drivers/infiniband/ulp/rds/ib_cm.c create mode 100644 drivers/infiniband/ulp/rds/ib_rdma.c create mode 100644 drivers/infiniband/ulp/rds/ib_recv.c create mode 100644 drivers/infiniband/ulp/rds/ib_ring.c create mode 100644 drivers/infiniband/ulp/rds/ib_send.c create mode 100644 drivers/infiniband/ulp/rds/ib_stats.c create mode 100644 drivers/infiniband/ulp/rds/ib_sysctl.c create mode 100644 drivers/infiniband/ulp/rds/info.c create mode 100644 drivers/infiniband/ulp/rds/info.h create mode 100644 drivers/infiniband/ulp/rds/loop.c create mode 100644 drivers/infiniband/ulp/rds/loop.h create mode 100644 drivers/infiniband/ulp/rds/message.c create mode 100644 drivers/infiniband/ulp/rds/page.c create mode 100644 drivers/infiniband/ulp/rds/rdma.c create mode 100644 drivers/infiniband/ulp/rds/rdma.h create mode 100644 drivers/infiniband/ulp/rds/rds.h create mode 100644 drivers/infiniband/ulp/rds/rds_rdma.h create mode 100644 drivers/infiniband/ulp/rds/recv.c create mode 100644 drivers/infiniband/ulp/rds/send.c create mode 100644 drivers/infiniband/ulp/rds/stats.c create mode 100644 drivers/infiniband/ulp/rds/sysctl.c create mode 100644 drivers/infiniband/ulp/rds/threads.c create mode 100644 drivers/infiniband/ulp/rds/transport.c end From andy.grover at oracle.com Mon Jan 26 18:17:38 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:38 -0800 Subject: [ofa-general] [PATCH 01/21] RDS: Socket interface In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-2-git-send-email-andy.grover@oracle.com> Implement the RDS (Reliable Datagram Sockets) interface. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/af_rds.c | 677 +++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/bind.c | 202 +++++++++++ 2 files changed, 879 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/af_rds.c create mode 100644 drivers/infiniband/ulp/rds/bind.c diff --git a/drivers/infiniband/ulp/rds/af_rds.c b/drivers/infiniband/ulp/rds/af_rds.c new file mode 100644 index 0000000..7158438 --- /dev/null +++ b/drivers/infiniband/ulp/rds/af_rds.c @@ -0,0 +1,677 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include + +#include "rds.h" +#include "rdma.h" + +static int enable_rdma = 1; + +module_param(enable_rdma, int, 0444); +MODULE_PARM_DESC(enable_rdma, " Enable RDMA operations support"); + +/* this is just used for stats gathering :/ */ +static DEFINE_SPINLOCK(rds_sock_lock); +static unsigned long rds_sock_count; +static LIST_HEAD(rds_sock_list); +DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq); + +/* + * This is called as the final descriptor referencing this socket is closed. + * We have to unbind the socket so that another socket can be bound to the + * address it was using. + * + * We have to be careful about racing with the incoming path. sock_orphan() + * sets SOCK_DEAD and we use that as an indicator to the rx path that new + * messages shouldn't be queued. + */ +static int rds_release(struct socket *sock) +{ + struct sock *sk = sock->sk; + struct rds_sock *rs; + unsigned long flags; + + if (sk == NULL) + goto out; + + rs = rds_sk_to_rs(sk); + + sock_orphan(sk); + /* Note - rds_clear_recv_queue grabs rs_recv_lock, so + * that ensures the recv path has completed messing + * with the socket. */ + rds_clear_recv_queue(rs); + rds_cong_remove_socket(rs); + rds_remove_bound(rs); + rds_send_drop_to(rs, NULL); + rds_rdma_drop_keys(rs); + rds_notify_queue_get(rs, NULL); + + spin_lock_irqsave(&rds_sock_lock, flags); + list_del_init(&rs->rs_item); + rds_sock_count--; + spin_unlock_irqrestore(&rds_sock_lock, flags); + + sock->sk = NULL; + sock_put(sk); +out: + return 0; +} + +/* + * Careful not to race with rds_release -> sock_orphan which clears sk_sleep. + * _bh() isn't OK here, we're called from interrupt handlers. It's probably OK + * to wake the waitqueue after sk_sleep is clear as we hold a sock ref, but + * this seems more conservative. + * NB - normally, one would use sk_callback_lock for this, but we can + * get here from interrupts, whereas the network code grabs sk_callback_lock + * with _lock_bh only - so relying on sk_callback_lock introduces livelocks. + */ +void rds_wake_sk_sleep(struct rds_sock *rs) +{ + unsigned long flags; + + read_lock_irqsave(&rs->rs_recv_lock, flags); + __rds_wake_sk_sleep(rds_rs_to_sk(rs)); + read_unlock_irqrestore(&rs->rs_recv_lock, flags); +} + +static int rds_getname(struct socket *sock, struct sockaddr *uaddr, + int *uaddr_len, int peer) +{ + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; + struct rds_sock *rs = rds_sk_to_rs(sock->sk); + + memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); + + /* racey, don't care */ + if (peer) { + if (!rs->rs_conn_addr) + return -ENOTCONN; + + sin->sin_port = rs->rs_conn_port; + sin->sin_addr.s_addr = rs->rs_conn_addr; + } else { + sin->sin_port = rs->rs_bound_port; + sin->sin_addr.s_addr = rs->rs_bound_addr; + } + + sin->sin_family = AF_INET; + + *uaddr_len = sizeof(*sin); + return 0; +} + +/* + * RDS' poll is without a doubt the least intuitive part of the interface, + * as POLLIN and POLLOUT do not behave entirely as you would expect from + * a network protocol. + * + * POLLIN is asserted if + * - there is data on the receive queue. + * - to signal that a previously congested destination may have become + * uncongested + * - A notification has been queued to the socket (this can be a congestion + * update, or a RDMA completion). + * + * POLLOUT is asserted if there is room on the send queue. This does not mean + * however, that the next sendmsg() call will succeed. If the application tries + * to send to a congested destination, the system call may still fail (and + * return ENOBUFS). + */ +static unsigned int rds_poll(struct file *file, struct socket *sock, + poll_table *wait) +{ + struct sock *sk = sock->sk; + struct rds_sock *rs = rds_sk_to_rs(sk); + unsigned int mask = 0; + unsigned long flags; + + poll_wait(file, sk->sk_sleep, wait); + + poll_wait(file, &rds_poll_waitq, wait); + + read_lock_irqsave(&rs->rs_recv_lock, flags); + if (!rs->rs_cong_monitor) { + /* When a congestion map was updated, we signal POLLIN for + * "historical" reasons. Applications can also poll for + * WRBAND instead. */ + if (rds_cong_updated_since(&rs->rs_cong_track)) + mask |= (POLLIN | POLLRDNORM | POLLWRBAND); + } else { + spin_lock(&rs->rs_lock); + if (rs->rs_cong_notify) + mask |= (POLLIN | POLLRDNORM); + spin_unlock(&rs->rs_lock); + } + if (!list_empty(&rs->rs_recv_queue) + || !list_empty(&rs->rs_notify_queue)) + mask |= (POLLIN | POLLRDNORM); + if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) + mask |= (POLLOUT | POLLWRNORM); + read_unlock_irqrestore(&rs->rs_recv_lock, flags); + + return mask; +} + +static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) +{ +#ifdef KERNEL_HAS_CORE_CALLING_DEV_IOCTL + return -ENOIOCTLCMD; +#endif /* KERNEL_HAS_CORE_CALLING_DEV_IOCTL */ +#ifndef KERNEL_HAS_CORE_CALLING_DEV_IOCTL + return dev_ioctl(cmd, (void __user *)arg); +#endif /* KERNEL_HAS_CORE_CALLING_DEV_IOCTL */ +} + +static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval, + int len) +{ + struct sockaddr_in sin; + int ret = 0; + + /* racing with another thread binding seems ok here */ + if (rs->rs_bound_addr == 0) { + ret = -ENOTCONN; /* XXX not a great errno */ + goto out; + } + + if (len < sizeof(struct sockaddr_in)) { + ret = -EINVAL; + goto out; + } + + if (copy_from_user(&sin, optval, sizeof(sin))) { + ret = -EFAULT; + goto out; + } + + rds_send_drop_to(rs, &sin); +out: + return ret; +} + +static int rds_set_bool_option(unsigned char *optvar, char __user *optval, + int optlen) +{ + int value; + + if (optlen < sizeof(int)) + return -EINVAL; + if (get_user(value, (int __user *) optval)) + return -EFAULT; + *optvar = !!value; + return 0; +} + +static int rds_cong_monitor(struct rds_sock *rs, char __user *optval, + int optlen) +{ + int ret; + + ret = rds_set_bool_option(&rs->rs_cong_monitor, optval, optlen); + if (ret == 0) { + if (rs->rs_cong_monitor) { + rds_cong_add_socket(rs); + } else { + rds_cong_remove_socket(rs); + rs->rs_cong_mask = 0; + rs->rs_cong_notify = 0; + } + } + return ret; +} + +static int rds_setsockopt(struct socket *sock, int level, int optname, + char __user *optval, int optlen) +{ + struct rds_sock *rs = rds_sk_to_rs(sock->sk); + int ret; + + if (level != SOL_RDS) { + ret = -ENOPROTOOPT; + goto out; + } + + switch (optname) { + case RDS_CANCEL_SENT_TO: + ret = rds_cancel_sent_to(rs, optval, optlen); + break; + case RDS_GET_MR: + if (enable_rdma) + ret = rds_get_mr(rs, optval, optlen); + else + ret = -EOPNOTSUPP; + break; + case RDS_FREE_MR: + if (enable_rdma) + ret = rds_free_mr(rs, optval, optlen); + else + ret = -EOPNOTSUPP; + break; + case RDS_RECVERR: + ret = rds_set_bool_option(&rs->rs_recverr, optval, optlen); + break; + case RDS_CONG_MONITOR: + ret = rds_cong_monitor(rs, optval, optlen); + break; + default: + ret = -ENOPROTOOPT; + } +out: + return ret; +} + +static int rds_getsockopt(struct socket *sock, int level, int optname, + char __user *optval, int __user *optlen) +{ + struct rds_sock *rs = rds_sk_to_rs(sock->sk); + int ret = -ENOPROTOOPT, len; + + if (level != SOL_RDS) + goto out; + + if (get_user(len, optlen)) { + ret = -EFAULT; + goto out; + } + + switch (optname) { + case RDS_INFO_FIRST ... RDS_INFO_LAST: + ret = rds_info_getsockopt(sock, optname, optval, + optlen); + break; + + case RDS_RECVERR: + if (len < sizeof(int)) + ret = -EINVAL; + else + if (put_user(rs->rs_recverr, (int __user *) optval) + || put_user(sizeof(int), optlen)) + ret = -EFAULT; + else + ret = 0; + break; + default: + break; + } + +out: + return ret; + +} + +static int rds_connect(struct socket *sock, struct sockaddr *uaddr, + int addr_len, int flags) +{ + struct sock *sk = sock->sk; + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; + struct rds_sock *rs = rds_sk_to_rs(sk); + int ret = 0; + + lock_sock(sk); + + if (addr_len != sizeof(struct sockaddr_in)) { + ret = -EINVAL; + goto out; + } + + if (sin->sin_family != AF_INET) { + ret = -EAFNOSUPPORT; + goto out; + } + + if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) { + ret = -EDESTADDRREQ; + goto out; + } + + rs->rs_conn_addr = sin->sin_addr.s_addr; + rs->rs_conn_port = sin->sin_port; + +out: + release_sock(sk); + return ret; +} + +#ifdef KERNEL_HAS_PROTO_REGISTER +static struct proto rds_proto = { + .name = "RDS", + .owner = THIS_MODULE, + .obj_size = sizeof(struct rds_sock), +}; +#endif /* KERNEL_HAS_PROTO_REGISTER */ + +static struct proto_ops rds_proto_ops = { + .family = AF_RDS, + .owner = THIS_MODULE, + .release = rds_release, + .bind = rds_bind, + .connect = rds_connect, + .socketpair = sock_no_socketpair, + .accept = sock_no_accept, + .getname = rds_getname, + .poll = rds_poll, + .ioctl = rds_ioctl, + .listen = sock_no_listen, + .shutdown = sock_no_shutdown, + .setsockopt = rds_setsockopt, + .getsockopt = rds_getsockopt, + .sendmsg = rds_sendmsg, + .recvmsg = rds_recvmsg, + .mmap = sock_no_mmap, + .sendpage = sock_no_sendpage, +}; + +#ifndef KERNEL_HAS_PROTO_REGISTER +static struct sock *sk_alloc_compat(int pf, gfp_t gfp, struct proto *prot, + int zero_it) +{ + struct rds_sock *rs; + + sk = sk_alloc(pf, gfp, prot, zero_it); + if (sk == NULL) + return NULL; + + rs = kcalloc(1, sizeof(struct rds_sock), GFP_ATOMIC); + if (rs == NULL) { + sk_free(sk); + return NULL; + } + + /* sock_def_destruct frees this for us */ + sk->sk_protinfo = rs; + rs->rs_sk = sk; + + return sk; +} + +#undef sk_alloc +#define sk_alloc sk_alloc_compat +#endif /* KERNEL_HAS_PROTO_REGISTER */ + +static int __rds_create(struct socket *sock, struct sock *sk, int protocol) +{ + unsigned long flags; + struct rds_sock *rs; + + sock_init_data(sock, sk); +#ifndef KERNEL_HAS_PROTO_REGISTER + /* Can this be moved to sk_alloc_compat? */ + sk_set_owner(sk, THIS_MODULE); +#endif /* KERNEL_HAS_PROTO_REGISTER */ + sock->ops = &rds_proto_ops; + sk->sk_protocol = protocol; + + rs = rds_sk_to_rs(sk); + spin_lock_init(&rs->rs_lock); + rwlock_init(&rs->rs_recv_lock); + INIT_LIST_HEAD(&rs->rs_send_queue); + INIT_LIST_HEAD(&rs->rs_recv_queue); + INIT_LIST_HEAD(&rs->rs_notify_queue); + INIT_LIST_HEAD(&rs->rs_cong_list); + spin_lock_init(&rs->rs_rdma_lock); + rs->rs_rdma_keys = RB_ROOT; + + spin_lock_irqsave(&rds_sock_lock, flags); + list_add_tail(&rs->rs_item, &rds_sock_list); + rds_sock_count++; + spin_unlock_irqrestore(&rds_sock_lock, flags); + + return 0; +} + +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24) +static int rds_create(struct socket *sock, int protocol) +{ + struct sock *sk; + + if (sock->type != SOCK_SEQPACKET || protocol) + return -ESOCKTNOSUPPORT; + + sk = sk_alloc(AF_RDS, GFP_ATOMIC, &rds_proto, 1); + if (sk == NULL) + return -ENOMEM; + + return __rds_create(sock, sk, protocol); +} +#else +static int rds_create(struct net *net, struct socket *sock, int protocol) +{ + struct sock *sk; + + if (sock->type != SOCK_SEQPACKET || protocol) + return -ESOCKTNOSUPPORT; + + sk = sk_alloc(net, AF_RDS, GFP_ATOMIC, &rds_proto); + if (!sk) + return -ENOMEM; + + return __rds_create(sock, sk, protocol); +} +#endif + +void rds_sock_addref(struct rds_sock *rs) +{ + sock_hold(rds_rs_to_sk(rs)); +} + +void rds_sock_put(struct rds_sock *rs) +{ + sock_put(rds_rs_to_sk(rs)); +} + +static struct net_proto_family rds_family_ops = { + .family = AF_RDS, + .create = rds_create, + .owner = THIS_MODULE, +}; + +static void rds_sock_inc_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + struct rds_sock *rs; + struct sock *sk; + struct rds_incoming *inc; + unsigned long flags; + unsigned int total = 0; + + len /= sizeof(struct rds_info_message); + + spin_lock_irqsave(&rds_sock_lock, flags); + + list_for_each_entry(rs, &rds_sock_list, rs_item) { + sk = rds_rs_to_sk(rs); + read_lock(&rs->rs_recv_lock); + + /* XXX too lazy to maintain counts.. */ + list_for_each_entry(inc, &rs->rs_recv_queue, i_item) { + total++; + if (total <= len) + rds_inc_info_copy(inc, iter, inc->i_saddr, + rs->rs_bound_addr, 1); + } + + read_unlock(&rs->rs_recv_lock); + } + + spin_unlock_irqrestore(&rds_sock_lock, flags); + + lens->nr = total; + lens->each = sizeof(struct rds_info_message); +} + +static void rds_sock_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + struct rds_info_socket sinfo; + struct rds_sock *rs; + unsigned long flags; + + len /= sizeof(struct rds_info_socket); + + spin_lock_irqsave(&rds_sock_lock, flags); + + if (len < rds_sock_count) + goto out; + + list_for_each_entry(rs, &rds_sock_list, rs_item) { + sinfo.sndbuf = rds_sk_sndbuf(rs); + sinfo.rcvbuf = rds_sk_rcvbuf(rs); + sinfo.bound_addr = rs->rs_bound_addr; + sinfo.connected_addr = rs->rs_conn_addr; + sinfo.bound_port = rs->rs_bound_port; + sinfo.connected_port = rs->rs_conn_port; + sinfo.inum = sock_i_ino(rds_rs_to_sk(rs)); + + rds_info_copy(iter, &sinfo, sizeof(sinfo)); + } + +out: + lens->nr = rds_sock_count; + lens->each = sizeof(struct rds_info_socket); + + spin_unlock_irqrestore(&rds_sock_lock, flags); +} + +/* + * The order is important here. + * + * rds_trans_stop_listening() is called before conn_exit so new connections + * don't hit while existing ones are being torn down. + * + * rds_conn_exit() is before rds_trans_exit() as rds_conn_exit() calls into the + * transports to free connections and incoming fragments as they're torn down. + */ +static void __exit rds_exit(void) +{ + rds_ib_exit(); + sock_unregister(rds_family_ops.family); +#ifdef KERNEL_HAS_PROTO_REGISTER + proto_unregister(&rds_proto); +#endif /* KERNEL_HAS_PROTO_REGISTER */ + rds_trans_stop_listening(); + rds_conn_exit(); + rds_cong_exit(); + rds_sysctl_exit(); + rds_threads_exit(); + rds_stats_exit(); + rds_page_exit(); + rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info); + rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); +} +module_exit(rds_exit); + +static int __init rds_init(void) +{ + int ret; + +#ifndef KERNEL_HAS_NOT_DEFINED + /* the strange ifdef above has scripts/makepatch.sh strip this out */ +#if PF_RDS == 21 + printk(KERN_ERR "!!! This build of RDS is using PF 21 which is not " + "reserved\n"); + printk(KERN_ERR "!!! This is only suitable for testing, DO NOT " + "RELEASE THIS.\n"); + printk(KERN_ERR "!!! No, seriously.\n"); +#endif +#endif /* KERNEL_HAS_NOT_DEFINED */ + + ret = rds_conn_init(); + if (ret) + goto out; + ret = rds_threads_init(); + if (ret) + goto out_conn; + ret = rds_sysctl_init(); + if (ret) + goto out_threads; + ret = rds_stats_init(); + if (ret) + goto out_sysctl; +#ifdef KERNEL_HAS_PROTO_REGISTER + ret = proto_register(&rds_proto, 1); + if (ret) + goto out_stats; +#endif /* KERNEL_HAS_PROTO_REGISTER */ + ret = sock_register(&rds_family_ops); + if (ret) + goto out_proto; + + rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info); + rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); + + ret = rds_ib_init(); + if (ret) + goto out_sock; + goto out; + +out_sock: + sock_unregister(rds_family_ops.family); +out_proto: +#ifdef KERNEL_HAS_PROTO_REGISTER + proto_unregister(&rds_proto); +out_stats: +#endif /* KERNEL_HAS_PROTO_REGISTER */ + rds_stats_exit(); +out_sysctl: + rds_sysctl_exit(); +out_threads: + rds_threads_exit(); +out_conn: + rds_conn_exit(); + rds_cong_exit(); + rds_page_exit(); +out: + return ret; +} +module_init(rds_init); + +#define DRV_VERSION "4.0" +#define DRV_RELDATE "July 28, 2008" + +MODULE_AUTHOR("Zach Brown"); +MODULE_AUTHOR("Olaf Kirch"); +MODULE_DESCRIPTION("RDS: Reliable Datagram Sockets" + " v" DRV_VERSION " (" DRV_RELDATE ")"); +MODULE_VERSION(DRV_VERSION); +MODULE_LICENSE("Dual BSD/GPL"); +MODULE_ALIAS_NETPROTO(PF_RDS); diff --git a/drivers/infiniband/ulp/rds/bind.c b/drivers/infiniband/ulp/rds/bind.c new file mode 100644 index 0000000..0f625d0 --- /dev/null +++ b/drivers/infiniband/ulp/rds/bind.c @@ -0,0 +1,202 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include "rds.h" + +/* + * XXX this probably still needs more work.. no INADDR_ANY, and rbtrees aren't + * particularly zippy. + * + * This is now called for every incoming frame so we arguably care much more + * about it than we used to. + */ +static DEFINE_SPINLOCK(rds_bind_lock); +static struct rb_root rds_bind_tree = RB_ROOT; + +static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port, + struct rds_sock *insert) +{ + struct rb_node **p = &rds_bind_tree.rb_node; + struct rb_node *parent = NULL; + struct rds_sock *rs; + u64 cmp; + u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port); + + while (*p) { + parent = *p; + rs = rb_entry(parent, struct rds_sock, rs_bound_node); + + cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) | + be16_to_cpu(rs->rs_bound_port); + + if (needle < cmp) + p = &(*p)->rb_left; + else if (needle > cmp) + p = &(*p)->rb_right; + else + return rs; + } + + if (insert) { + rb_link_node(&insert->rs_bound_node, parent, p); + rb_insert_color(&insert->rs_bound_node, &rds_bind_tree); + } + return NULL; +} + +/* + * Return the rds_sock bound at the given local address. + * + * The rx path can race with rds_release. We notice if rds_release() has + * marked this socket and don't return a rs ref to the rx path. + */ +struct rds_sock *rds_find_bound(__be32 addr, __be16 port) +{ + struct rds_sock *rs; + unsigned long flags; + + spin_lock_irqsave(&rds_bind_lock, flags); + rs = rds_bind_tree_walk(addr, port, NULL); + if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) + rds_sock_addref(rs); + else + rs = NULL; + spin_unlock_irqrestore(&rds_bind_lock, flags); + + rdsdebug("returning rs %p for %u.%u.%u.%u:%u\n", rs, NIPQUAD(addr), + ntohs(port)); + return rs; +} + +/* returns -ve errno or +ve port */ +static int rds_add_bound(struct rds_sock *rs, __be32 addr, __be16 *port) +{ + unsigned long flags; + int ret = -EADDRINUSE; + u16 rover, last; + + if (*port != 0) { + rover = be16_to_cpu(*port); + last = rover; + } else { + rover = max_t(u16, net_random(), 2); + last = rover - 1; + } + + spin_lock_irqsave(&rds_bind_lock, flags); + + do { + if (rover == 0) + rover++; + if (rds_bind_tree_walk(addr, cpu_to_be16(rover), rs) == NULL) { + *port = cpu_to_be16(rover); + ret = 0; + break; + } + } while (rover++ != last); + + if (ret == 0) { + rs->rs_bound_addr = addr; + rs->rs_bound_port = *port; + rds_sock_addref(rs); + + rdstrace(RDS_BIND, RDS_LOW, + "rs %p binding to %u.%u.%u.%u:%d\n", + rs, NIPQUAD(addr), (int)ntohs(*port)); + + } + + spin_unlock_irqrestore(&rds_bind_lock, flags); + + return ret; +} + +void rds_remove_bound(struct rds_sock *rs) +{ + unsigned long flags; + + spin_lock_irqsave(&rds_bind_lock, flags); + + if (rs->rs_bound_addr) { + rdstrace(RDS_BIND, RDS_LOW, + "rs %p unbinding from %u.%u.%u.%u:%d\n", + rs, NIPQUAD(rs->rs_bound_addr), + (int)ntohs(rs->rs_bound_port)); + + rb_erase(&rs->rs_bound_node, &rds_bind_tree); + rds_sock_put(rs); + rs->rs_bound_addr = 0; + } + + spin_unlock_irqrestore(&rds_bind_lock, flags); +} + +int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) +{ + struct sock *sk = sock->sk; + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; + struct rds_sock *rs = rds_sk_to_rs(sk); + struct rds_transport *trans; + int ret = 0; + + lock_sock(sk); + + if (addr_len != sizeof(struct sockaddr_in) || + sin->sin_family != AF_INET || + rs->rs_bound_addr || + sin->sin_addr.s_addr == htonl(INADDR_ANY)) { + ret = -EINVAL; + goto out; + } + + ret = rds_add_bound(rs, sin->sin_addr.s_addr, &sin->sin_port); + if (ret) + goto out; + + trans = rds_trans_get_preferred(sin->sin_addr.s_addr); + if (trans == NULL) { + ret = -EADDRNOTAVAIL; + rds_remove_bound(rs); + goto out; + } + + rs->rs_transport = trans; + ret = 0; + +out: + release_sock(sk); + return ret; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:39 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:39 -0800 Subject: [ofa-general] [PATCH 02/21] RDS: Main header file In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-3-git-send-email-andy.grover@oracle.com> RDS's main data structure definitions and exported functions. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/rds.h | 763 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 763 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/rds.h diff --git a/drivers/infiniband/ulp/rds/rds.h b/drivers/infiniband/ulp/rds/rds.h new file mode 100644 index 0000000..133a237 --- /dev/null +++ b/drivers/infiniband/ulp/rds/rds.h @@ -0,0 +1,763 @@ +#ifndef _RDS_H +#define _RDS_H + +#include +#include +#include + +#include +#include "rds_rdma.h" + +/* + * RDS Network protocol version + */ +#define RDS_PROTOCOL_3_0 0x0300 +#define RDS_PROTOCOL_3_1 0x0301 +#define RDS_PROTOCOL_VERSION RDS_PROTOCOL_3_1 +#define RDS_PROTOCOL_MAJOR(v) ((v) >> 8) +#define RDS_PROTOCOL_MINOR(v) ((v) & 255) +#define RDS_PROTOCOL(maj, min) (((maj) << 8) | min) + +/* + * XXX randomly chosen, but at least seems to be unused: + * # 18464-18768 Unassigned + * We should do better. We want a reserved port to discourage unpriv'ed + * userspace from listening. + */ +#define RDS_PORT 18634 + +#ifndef AF_RDS +#define AF_RDS 28 /* Reliable Datagram Socket */ +#endif + +#ifndef PF_RDS +#define PF_RDS AF_RDS +#endif + +#ifndef SOL_RDS +#define SOL_RDS 272 +#endif + +#define KERNEL_HAS_PROTO_REGISTER 1 +#define KERNEL_HAS_INET_SK_RETURNING_INET_SOCK 1 +#define KERNEL_HAS_CORE_CALLING_DEV_IOCTL 1 + +#ifdef ATOMIC64_INIT +#define KERNEL_HAS_ATOMIC64 +#endif + +/* x86-64 doesn't include kmap_types.h from anywhere */ +#include +#include + +#include "info.h" + +#ifdef DEBUG +#define rdsdebug(fmt, args...) pr_debug("%s(): " fmt, __func__ , ##args) +#else +/* sigh, pr_debug() causes unused variable warnings */ +static inline void __attribute__ ((format (printf, 1, 2))) +rdsdebug(char *fmt, ...) +{ +} +#endif + +/* + * RDS trace facilities + */ +enum { + RDS_BIND = 0, + RDS_CONG, + RDS_CONNECTION, + RDS_RDMA, + RDS_PAGE, + RDS_SEND, + RDS_RECV, + RDS_THREADS, + RDS_INFO, + RDS_MESSAGE, + RDS_IB, + RDS_IB_CM, + RDS_IB_RDMA, + RDS_IB_RING, + RDS_IB_RECV, + RDS_IB_SEND, + RDS_TCP, + RDS_TCP_CONNECT, + RDS_TCP_LISTEN, + RDS_TCP_RECV, + RDS_TCP_SEND +}; + +enum { + RDS_ALWAYS = 0, + RDS_MINIMAL, + RDS_LOW, + RDS_MEDIUM, + RDS_HIGH, + RDS_VERBOSE +}; + + +#define rdstrace(fac, lvl, fmt, args...) do { \ + if (test_bit(fac, &rds_sysctl_trace_flags) && \ + lvl <= rds_sysctl_trace_level) \ + printk("%s(): " fmt, __func__, ##args); \ +} while(0); + +/* XXX is there one of these somewhere? */ +#define ceil(x, y) \ + ({ unsigned long __x = (x), __y = (y); (__x + __y - 1) / __y; }) + +#define RDS_FRAG_SHIFT 12 +#define RDS_FRAG_SIZE ((unsigned int)(1 << RDS_FRAG_SHIFT)) + +#define RDS_CONG_MAP_BYTES (65536 / 8) +#define RDS_CONG_MAP_LONGS (RDS_CONG_MAP_BYTES / sizeof(unsigned long)) +#define RDS_CONG_MAP_PAGES (PAGE_ALIGN(RDS_CONG_MAP_BYTES) / PAGE_SIZE) +#define RDS_CONG_MAP_PAGE_BITS (PAGE_SIZE * 8) + +struct rds_cong_map { + struct rb_node m_rb_node; + __be32 m_addr; + wait_queue_head_t m_waitq; + struct list_head m_conn_list; + unsigned long m_page_addrs[RDS_CONG_MAP_PAGES]; +}; + + +/* + * This is how we will track the connection state: + * A connection is always in one of the following + * states. Updates to the state are atomic and imply + * a memory barrier. + */ +enum { + RDS_CONN_DOWN = 0, + RDS_CONN_CONNECTING, + RDS_CONN_DISCONNECTING, + RDS_CONN_UP, + RDS_CONN_ERROR, +}; + +/* Bits for c_flags */ +#define RDS_LL_SEND_FULL 0 +#define RDS_RECONNECT_PENDING 1 + +struct rds_connection { + struct hlist_node c_hash_node; + __be32 c_laddr; + __be32 c_faddr; + unsigned int c_loopback:1; + struct rds_connection *c_passive; + + struct rds_cong_map *c_lcong; + struct rds_cong_map *c_fcong; + + struct mutex c_send_lock; /* protect send ring */ + struct rds_message *c_xmit_rm; + unsigned long c_xmit_sg; + unsigned int c_xmit_hdr_off; + unsigned int c_xmit_data_off; + unsigned int c_xmit_rdma_sent; + + spinlock_t c_lock; /* protect msg queues */ + u64 c_next_tx_seq; + struct list_head c_send_queue; + struct list_head c_retrans; + + u64 c_next_rx_seq; + + struct rds_transport *c_trans; + void *c_transport_data; + + atomic_t c_state; + unsigned long c_flags; + unsigned long c_reconnect_jiffies; + struct delayed_work c_send_w; + struct delayed_work c_recv_w; + struct delayed_work c_conn_w; + struct work_struct c_down_w; + struct mutex c_cm_lock; /* protect conn state & cm */ + + struct list_head c_map_item; + unsigned long c_map_queued; + unsigned long c_map_offset; + unsigned long c_map_bytes; + + unsigned int c_unacked_packets; + unsigned int c_unacked_bytes; + + /* Protocol version */ + unsigned int c_version; +}; + +#define RDS_FLAG_CONG_BITMAP 0x01 +#define RDS_FLAG_ACK_REQUIRED 0x02 +#define RDS_FLAG_RETRANSMITTED 0x04 +#define RDS_MAX_ADV_CREDIT 255 + +/* + * Maximum space available for extension headers. + */ +#define RDS_HEADER_EXT_SPACE 16 + +struct rds_header { + __be64 h_sequence; + __be64 h_ack; + __be32 h_len; + __be16 h_sport; + __be16 h_dport; + u8 h_flags; + u8 h_credit; + u8 h_padding[4]; + __sum16 h_csum; + + u8 h_exthdr[RDS_HEADER_EXT_SPACE]; +}; + +/* + * Reserved - indicates end of extensions + */ +#define RDS_EXTHDR_NONE 0 + +/* + * This extension header is included in the very + * first message that is sent on a new connection, + * and identifies the protocol level. This will help + * rolling updates if a future change requires breaking + * the protocol. + * NB: This is no longer true for IB, where we do a version + * negotiation during the connection setup phase (protocol + * version information is included in the RDMA CM private data). + */ +#define RDS_EXTHDR_VERSION 1 +struct rds_ext_header_version { + __be32 h_version; +}; + +/* + * This extension header is included in the RDS message + * chasing an RDMA operation. + */ +#define RDS_EXTHDR_RDMA 2 +struct rds_ext_header_rdma { + __be32 h_rdma_rkey; +}; + +/* + * This extension header tells the peer about the + * destination of the requested RDMA + * operation. + */ +#define RDS_EXTHDR_RDMA_DEST 3 +struct rds_ext_header_rdma_dest { + __be32 h_rdma_rkey; + __be32 h_rdma_offset; +}; + +#define __RDS_EXTHDR_MAX 16 /* for now */ + +struct rds_incoming { + atomic_t i_refcount; + struct list_head i_item; + struct rds_connection *i_conn; + struct rds_header i_hdr; + unsigned long i_rx_jiffies; + __be32 i_saddr; + + rds_rdma_cookie_t i_rdma_cookie; +}; + +/* + * m_sock_item and m_conn_item are on lists that are serialized under + * conn->c_lock. m_sock_item has additional meaning in that once it is empty + * the message will not be put back on the retransmit list after being sent. + * messages that are canceled while being sent rely on this. + * + * m_inc is used by loopback so that it can pass an incoming message straight + * back up into the rx path. It embeds a wire header which is also used by + * the send path, which is kind of awkward. + * + * m_sock_item indicates the message's presence on a socket's send or receive + * queue. m_rs will point to that socket. + * + * m_daddr is used by cancellation to prune messages to a given destination. + * + * The RDS_MSG_ON_SOCK and RDS_MSG_ON_CONN flags are used to avoid lock + * nesting. As paths iterate over messages on a sock, or conn, they must + * also lock the conn, or sock, to remove the message from those lists too. + * Testing the flag to determine if the message is still on the lists lets + * us avoid testing the list_head directly. That means each path can use + * the message's list_head to keep it on a local list while juggling locks + * without confusing the other path. + * + * m_ack_seq is an optional field set by transports who need a different + * sequence number range to invalidate. They can use this in a callback + * that they pass to rds_send_drop_acked() to see if each message has been + * acked. The HAS_ACK_SEQ flag can be used to detect messages which haven't + * had ack_seq set yet. + */ +#define RDS_MSG_ON_SOCK 1 +#define RDS_MSG_ON_CONN 2 +#define RDS_MSG_HAS_ACK_SEQ 3 +#define RDS_MSG_ACK_REQUIRED 4 +#define RDS_MSG_RETRANSMITTED 5 +#define RDS_MSG_MAPPED 6 +#define RDS_MSG_PAGEVEC 7 + +struct rds_message { + atomic_t m_refcount; + struct list_head m_sock_item; + struct list_head m_conn_item; + struct rds_incoming m_inc; + u64 m_ack_seq; + __be32 m_daddr; + unsigned long m_flags; + + /* Never access m_rs without holding m_rs_lock. + * Lock nesting is + * rm->m_rs_lock + * -> rs->rs_lock + */ + spinlock_t m_rs_lock; + struct rds_sock *m_rs; + struct rds_rdma_op *m_rdma_op; + rds_rdma_cookie_t m_rdma_cookie; + struct rds_mr *m_rdma_mr; + unsigned int m_nents; + unsigned int m_count; + struct scatterlist m_sg[0]; +}; + +/* + * The RDS notifier is used (optionally) to tell the application about + * completed RDMA operations. Rather than keeping the whole rds message + * around on the queue, we allocate a small notifier that is put on the + * socket's notifier_list. Notifications are delivered to the application + * through control messages. + */ +struct rds_notifier { + struct list_head n_list; + uint64_t n_user_token; + int n_status; +}; + +/** + * struct rds_transport - transport specific behavioural hooks + * + * @xmit: .xmit is called by rds_send_xmit() to tell the transport to send + * part of a message. The caller serializes on the send_sem so this + * doesn't need to be reentrant for a given conn. The header must be + * sent before the data payload. .xmit must be prepared to send a + * message with no data payload. .xmit should return the number of + * bytes that were sent down the connection, including header bytes. + * Returning 0 tells the caller that it doesn't need to perform any + * additional work now. This is usually the case when the transport has + * filled the sending queue for its connection and will handle + * triggering the rds thread to continue the send when space becomes + * available. Returning -EAGAIN tells the caller to retry the send + * immediately. Returning -ENOMEM tells the caller to retry the send at + * some point in the future. + * + * @conn_shutdown: conn_shutdown stops traffic on the given connection. Once + * it returns the connection can not call rds_recv_incoming(). + * This will only be called once after conn_connect returns + * non-zero success and will The caller serializes this with + * the send and connecting paths (xmit_* and conn_*). The + * transport is responsible for other serialization, including + * rds_recv_incoming(). This is called in process context but + * should try hard not to block. + * + * @xmit_cong_map: This asks the transport to send the local bitmap down the + * given connection. XXX get a better story about the bitmap + * flag and header. + */ + +struct rds_transport { + struct list_head t_item; + struct module *t_owner; + char *t_name; + unsigned int t_prefer_loopback:1; + + int (*laddr_check)(__be32 addr); + int (*conn_alloc)(struct rds_connection *conn, gfp_t gfp); + void (*conn_free)(void *data); + int (*conn_connect)(struct rds_connection *conn); + void (*conn_shutdown)(struct rds_connection *conn); + void (*xmit_prepare)(struct rds_connection *conn); + void (*xmit_complete)(struct rds_connection *conn); + int (*xmit)(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, unsigned int off); + int (*xmit_cong_map)(struct rds_connection *conn, + struct rds_cong_map *map, unsigned long offset); + int (*xmit_rdma)(struct rds_connection *conn, struct rds_rdma_op *op); + int (*recv)(struct rds_connection *conn); + int (*inc_copy_to_user)(struct rds_incoming *inc, struct iovec *iov, + size_t size); + void (*inc_purge)(struct rds_incoming *inc); + void (*inc_free)(struct rds_incoming *inc); + void (*listen_stop)(void); + unsigned int (*stats_info_copy)(struct rds_info_iterator *iter, + unsigned int avail); + void (*exit)(void); + void *(*get_mr)(struct scatterlist *sg, unsigned long nr_sg, + struct rds_sock *rs, u32 *key_ret); + void (*sync_mr)(void *trans_private, int direction); + void (*free_mr)(void *trans_private, int invalidate); + void (*flush_mrs)(void); +}; + +struct rds_sock { +#ifdef KERNEL_HAS_PROTO_REGISTER + struct sock rs_sk; +#endif +#ifndef KERNEL_HAS_PROTO_REGISTER + struct sock *rs_sk; +#endif + + u64 rs_user_addr; + u64 rs_user_bytes; + + /* + * bound_addr used for both incoming and outgoing, no INADDR_ANY + * support. + */ + struct rb_node rs_bound_node; + __be32 rs_bound_addr; + __be32 rs_conn_addr; + __be16 rs_bound_port; + __be16 rs_conn_port; + + /* + * This is only used to communicate the transport between bind and + * initiating connections. All other trans use is referenced through + * the connection. + */ + struct rds_transport *rs_transport; + + /* + * rds_sendmsg caches the conn it used the last time around. + * This helps avoid costly lookups. + */ + struct rds_connection *rs_conn; + + /* flag indicating we were congested or not */ + int rs_congested; + + /* rs_lock protects all these adjacent members before the newline */ + spinlock_t rs_lock; + struct list_head rs_send_queue; + u32 rs_snd_bytes; + int rs_rcv_bytes; + struct list_head rs_notify_queue; /* currently used for failed RDMAs */ + + /* Congestion wake_up. If rs_cong_monitor is set, we use cong_mask + * to decide whether the application should be woken up. + * If not set, we use rs_cong_track to find out whether a cong map + * update arrived. + */ + uint64_t rs_cong_mask; + uint64_t rs_cong_notify; + struct list_head rs_cong_list; + unsigned long rs_cong_track; + + /* + * rs_recv_lock protects the receive queue, and is + * used to serialize with rds_release. + */ + rwlock_t rs_recv_lock; + struct list_head rs_recv_queue; + + /* just for stats reporting */ + struct list_head rs_item; + + /* these have their own lock */ + spinlock_t rs_rdma_lock; + struct rb_root rs_rdma_keys; + + /* Socket options - in case there will be more */ + unsigned char rs_recverr, + rs_cong_monitor; +}; + +#ifdef KERNEL_HAS_PROTO_REGISTER +static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk) +{ + return container_of(sk, struct rds_sock, rs_sk); +} +static inline struct sock *rds_rs_to_sk(struct rds_sock *rs) +{ + return &rs->rs_sk; +} +#endif /* KERNEL_HAS_PROTO_REGISTER */ +#ifndef KERNEL_HAS_PROTO_REGISTER +static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk) +{ + return (struct rds_sock *)sk->sk_protinfo; +} +static inline struct sock *rds_rs_to_sk(const struct rds_sock *rs) +{ + return rs->rs_sk; +} +#endif /* KERNEL_HAS_PROTO_REGISTER */ + +/* + * The stack assigns sk_sndbuf and sk_rcvbuf to twice the specified value + * to account for overhead. We don't account for overhead, we just apply + * the number of payload bytes to the specified value. + */ +static inline int rds_sk_sndbuf(struct rds_sock *rs) +{ + return rds_rs_to_sk(rs)->sk_sndbuf / 2; +} +static inline int rds_sk_rcvbuf(struct rds_sock *rs) +{ + return rds_rs_to_sk(rs)->sk_rcvbuf / 2; +} + +struct rds_statistics { + uint64_t s_conn_reset; + uint64_t s_recv_drop_bad_checksum; + uint64_t s_recv_drop_old_seq; + uint64_t s_recv_drop_no_sock; + uint64_t s_recv_drop_dead_sock; + uint64_t s_recv_deliver_raced; + uint64_t s_recv_delivered; + uint64_t s_recv_queued; + uint64_t s_recv_immediate_retry; + uint64_t s_recv_delayed_retry; + uint64_t s_recv_ack_required; + uint64_t s_recv_rdma_bytes; + uint64_t s_recv_ping; + uint64_t s_send_queue_empty; + uint64_t s_send_queue_full; + uint64_t s_send_sem_contention; + uint64_t s_send_sem_queue_raced; + uint64_t s_send_immediate_retry; + uint64_t s_send_delayed_retry; + uint64_t s_send_drop_acked; + uint64_t s_send_ack_required; + uint64_t s_send_queued; + uint64_t s_send_rdma; + uint64_t s_send_rdma_bytes; + uint64_t s_send_pong; + uint64_t s_page_remainder_hit; + uint64_t s_page_remainder_miss; + uint64_t s_copy_to_user; + uint64_t s_copy_from_user; + uint64_t s_cong_update_queued; + uint64_t s_cong_update_received; + uint64_t s_cong_send_error; + uint64_t s_cong_send_blocked; +}; + +/* af_rds.c */ +void rds_sock_addref(struct rds_sock *rs); +void rds_sock_put(struct rds_sock *rs); +void rds_wake_sk_sleep(struct rds_sock *rs); +static inline void __rds_wake_sk_sleep(struct sock *sk) +{ + wait_queue_head_t *waitq = sk->sk_sleep; + + if (!sock_flag(sk, SOCK_DEAD) && waitq) + wake_up(waitq); +} +extern wait_queue_head_t rds_poll_waitq; + + +/* bind.c */ +int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len); +void rds_remove_bound(struct rds_sock *rs); +struct rds_sock *rds_find_bound(__be32 addr, __be16 port); + +/* cong.c */ +int rds_cong_get_maps(struct rds_connection *conn); +void rds_cong_add_conn(struct rds_connection *conn); +void rds_cong_remove_conn(struct rds_connection *conn); +void rds_cong_set_bit(struct rds_cong_map *map, __be16 port); +void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port); +int rds_cong_wait(struct rds_cong_map *map, __be16 port, int nonblock, struct rds_sock *rs); +void rds_cong_queue_updates(struct rds_cong_map *map); +void rds_cong_map_updated(struct rds_cong_map *map, uint64_t); +int rds_cong_updated_since(unsigned long *recent); +void rds_cong_add_socket(struct rds_sock *); +void rds_cong_remove_socket(struct rds_sock *); +void rds_cong_exit(void); +struct rds_message *rds_cong_update_alloc(struct rds_connection *conn); + +/* conn.c */ +int __init rds_conn_init(void); +void rds_conn_exit(void); +struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp); +struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp); +void rds_conn_destroy(struct rds_connection *conn); +void rds_conn_reset(struct rds_connection *conn); +void rds_conn_drop(struct rds_connection *conn); +void rds_for_each_conn_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens, + int (*visitor)(struct rds_connection *, void *), + size_t item_len); +void __rds_conn_error(struct rds_connection *conn, const char *, ...) + __attribute__ ((format (printf, 2, 3))); +#define rds_conn_error(conn, fmt...) \ + __rds_conn_error(conn, KERN_WARNING "RDS: " fmt) + +static inline int +rds_conn_transition(struct rds_connection *conn, int old, int new) +{ + return atomic_cmpxchg(&conn->c_state, old, new) == old; +} + +static inline int +rds_conn_state(struct rds_connection *conn) +{ + return atomic_read(&conn->c_state); +} + +static inline int +rds_conn_up(struct rds_connection *conn) +{ + return atomic_read(&conn->c_state) == RDS_CONN_UP; +} + +/* message.c */ +struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp); +struct rds_message *rds_message_copy_from_user(struct iovec *first_iov, + size_t total_len); +struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len); +void rds_message_populate_header(struct rds_header *hdr, __be16 sport, + __be16 dport, u64 seq); +int rds_message_add_extension(struct rds_header *hdr, + unsigned int type, const void *data, unsigned int len); +int rds_message_next_extension(struct rds_header *hdr, + unsigned int *pos, void *buf, unsigned int *buflen); +int rds_message_add_version_extension(struct rds_header *hdr, unsigned int version); +int rds_message_get_version_extension(struct rds_header *hdr, unsigned int *version); +int rds_message_add_rdma_dest_extension(struct rds_header *hdr, u32 r_key, u32 offset); +int rds_message_inc_copy_to_user(struct rds_incoming *inc, + struct iovec *first_iov, size_t size); +void rds_message_inc_purge(struct rds_incoming *inc); +void rds_message_inc_free(struct rds_incoming *inc); +void rds_message_addref(struct rds_message *rm); +void rds_message_put(struct rds_message *rm); +void rds_message_wait(struct rds_message *rm); +void rds_message_unmapped(struct rds_message *rm); + +static inline void rds_message_make_checksum(struct rds_header *hdr) +{ + hdr->h_csum = 0; + hdr->h_csum = ip_fast_csum((void *) hdr, sizeof(*hdr) >> 2); +} + +static inline int rds_message_verify_checksum(const struct rds_header *hdr) +{ + return !hdr->h_csum || ip_fast_csum((void *) hdr, sizeof(*hdr) >> 2) == 0; +} + + +/* page.c */ +int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes, + gfp_t gfp); +int rds_page_copy_user(struct page *page, unsigned long offset, + void __user *ptr, unsigned long bytes, + int to_user); +#define rds_page_copy_to_user(page, offset, ptr, bytes) \ + rds_page_copy_user(page, offset, ptr, bytes, 1) +#define rds_page_copy_from_user(page, offset, ptr, bytes) \ + rds_page_copy_user(page, offset, ptr, bytes, 0) +void rds_page_exit(void); + +/* recv.c */ +void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, + __be32 saddr); +void rds_inc_addref(struct rds_incoming *inc); +void rds_inc_put(struct rds_incoming *inc); +void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, + struct rds_incoming *inc, gfp_t gfp, enum km_type km); +int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t size, int msg_flags); +void rds_clear_recv_queue(struct rds_sock *rs); +int rds_notify_queue_get(struct rds_sock *rs, struct msghdr *msg); +void rds_inc_info_copy(struct rds_incoming *inc, + struct rds_info_iterator *iter, + __be32 saddr, __be32 daddr, int flip); + +/* send.c */ +int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t payload_len); +void rds_send_reset(struct rds_connection *conn); +int rds_send_xmit(struct rds_connection *conn); +struct sockaddr_in; +void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest); +typedef int (*is_acked_func)(struct rds_message *rm, uint64_t ack); +void rds_send_drop_acked(struct rds_connection *conn, u64 ack, + is_acked_func is_acked); +int rds_send_acked_before(struct rds_connection *conn, u64 seq); +void rds_send_remove_from_sock(struct list_head *messages, int status); +int rds_send_pong(struct rds_connection *conn, __be16 dport); +struct rds_message *rds_send_get_message(struct rds_connection *, + struct rds_rdma_op *); + +/* rdma.c */ +void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force); + +/* stats.c */ +DECLARE_PER_CPU(struct rds_statistics, rds_stats); +#define rds_stats_inc_which(which, member) do { \ + per_cpu(which, get_cpu()).member++; \ + put_cpu(); \ +} while (0) +#define rds_stats_inc(member) rds_stats_inc_which(rds_stats, member) +#define rds_stats_add_which(which, member, count) do { \ + per_cpu(which, get_cpu()).member += count; \ + put_cpu(); \ +} while (0) +#define rds_stats_add(member, count) rds_stats_add_which(rds_stats, member, count) +int __init rds_stats_init(void); +void rds_stats_exit(void); +void rds_stats_info_copy(struct rds_info_iterator *iter, + uint64_t *values, char **names, size_t nr); + +/* sysctl.c */ +int __init rds_sysctl_init(void); +void rds_sysctl_exit(void); +extern unsigned long rds_sysctl_sndbuf_min; +extern unsigned long rds_sysctl_sndbuf_default; +extern unsigned long rds_sysctl_sndbuf_max; +extern unsigned long rds_sysctl_reconnect_min_jiffies; +extern unsigned long rds_sysctl_reconnect_max_jiffies; +extern unsigned int rds_sysctl_max_unacked_packets; +extern unsigned int rds_sysctl_max_unacked_bytes; +extern unsigned int rds_sysctl_ping_enable; +extern unsigned long rds_sysctl_trace_flags; +extern unsigned int rds_sysctl_trace_level; + +/* threads.c */ +int __init rds_threads_init(void); +void rds_threads_exit(void); +extern struct workqueue_struct *rds_wq; +void rds_connect_worker(struct work_struct *); +void rds_shutdown_worker(struct work_struct *); +void rds_send_worker(struct work_struct *); +void rds_recv_worker(struct work_struct *); +void rds_connect_complete(struct rds_connection *conn); + +/* transport.c */ +int rds_trans_register(struct rds_transport *trans); +void rds_trans_unregister(struct rds_transport *trans); +struct rds_transport *rds_trans_get_preferred(__be32 addr); +void rds_trans_stop_listening(void); +unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail); +int __init rds_trans_init(void); +void rds_trans_exit(void); + +/* ib.c */ +void rds_ib_exit(void); +int rds_ib_init(void); + +#endif -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:40 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:40 -0800 Subject: [ofa-general] [PATCH 03/21] RDS: Congestion-handling code In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-4-git-send-email-andy.grover@oracle.com> RDS handles per-socket congestion by updating peers with a complete congestion map (8KB). This code keeps track of these maps for itself and ones received from peers. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/cong.c | 424 +++++++++++++++++++++++++++++++++++++ 1 files changed, 424 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/cong.c diff --git a/drivers/infiniband/ulp/rds/cong.c b/drivers/infiniband/ulp/rds/cong.c new file mode 100644 index 0000000..b7c49d2 --- /dev/null +++ b/drivers/infiniband/ulp/rds/cong.c @@ -0,0 +1,424 @@ +/* + * Copyright (c) 2007 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include + +#include "rds.h" + +/* + * This file implements the receive side of the unconventional congestion + * management in RDS. + * + * Messages waiting in the receive queue on the receiving socket are accounted + * against the sockets SO_RCVBUF option value. Only the payload bytes in the + * message are accounted for. If the number of bytes queued equals or exceeds + * rcvbuf then the socket is congested. All sends attempted to this socket's + * address should return block or return -EWOULDBLOCK. + * + * Applications are expected to be reasonably tuned such that this situation + * very rarely occurs. An application encountering this "back-pressure" is + * considered a bug. + * + * This is implemented by having each node maintain bitmaps which indicate + * which ports on bound addresses are congested. As the bitmap changes it is + * sent through all the connections which terminate in the local address of the + * bitmap which changed. + * + * The bitmaps are allocated as connections are brought up. This avoids + * allocation in the interrupt handling path which queues messages on sockets. + * The dense bitmaps let transports send the entire bitmap on any bitmap change + * reasonably efficiently. This is much easier to implement than some + * finer-grained communication of per-port congestion. The sender does a very + * inexpensive bit test to test if the port it's about to send to is congested + * or not. + */ + +/* + * Interaction with poll is a tad tricky. We want all processes stuck in + * poll to wake up and check whether a congested destination became uncongested. + * The really sad thing is we have no idea which destinations the application + * wants to send to - we don't even know which rds_connections are involved. + * So until we implement a more flexible rds poll interface, we have to make + * do with this: + * We maintain a global counter that is incremented each time a congestion map + * update is received. Each rds socket tracks this value, and if rds_poll + * finds that the saved generation number is smaller than the global generation + * number, it wakes up the process. + */ +static atomic_t rds_cong_generation = ATOMIC_INIT(0); + +/* + * Congestion monitoring + */ +static LIST_HEAD(rds_cong_monitor); +static DEFINE_RWLOCK(rds_cong_monitor_lock); + +/* + * Yes, a global lock. It's used so infrequently that it's worth keeping it + * global to simplify the locking. It's only used in the following + * circumstances: + * + * - on connection buildup to associate a conn with its maps + * - on map changes to inform conns of a new map to send + * + * It's sadly ordered under the socket callback lock and the connection lock. + * Receive paths can mark ports congested from interrupt context so the + * lock masks interrupts. + */ +static DEFINE_SPINLOCK(rds_cong_lock); +static struct rb_root rds_cong_tree = RB_ROOT; + +static struct rds_cong_map *rds_cong_tree_walk(__be32 addr, + struct rds_cong_map *insert) +{ + struct rb_node **p = &rds_cong_tree.rb_node; + struct rb_node *parent = NULL; + struct rds_cong_map *map; + + while (*p) { + parent = *p; + map = rb_entry(parent, struct rds_cong_map, m_rb_node); + + if (addr < map->m_addr) + p = &(*p)->rb_left; + else if (addr > map->m_addr) + p = &(*p)->rb_right; + else + return map; + } + + if (insert) { + rb_link_node(&insert->m_rb_node, parent, p); + rb_insert_color(&insert->m_rb_node, &rds_cong_tree); + } + return NULL; +} + +/* + * There is only ever one bitmap for any address. Connections try and allocate + * these bitmaps in the process getting pointers to them. The bitmaps are only + * ever freed as the module is removed after all connections have been freed. + */ +static struct rds_cong_map *rds_cong_from_addr(__be32 addr) +{ + struct rds_cong_map *map; + struct rds_cong_map *ret = NULL; + unsigned long zp; + unsigned long i; + unsigned long flags; + + map = kzalloc(sizeof(struct rds_cong_map), GFP_KERNEL); + if (map == NULL) + return NULL; + + map->m_addr = addr; + init_waitqueue_head(&map->m_waitq); + INIT_LIST_HEAD(&map->m_conn_list); + + for (i = 0; i < RDS_CONG_MAP_PAGES; i++) { + zp = get_zeroed_page(GFP_KERNEL); + if (zp == 0) + goto out; + map->m_page_addrs[i] = zp; + } + + spin_lock_irqsave(&rds_cong_lock, flags); + ret = rds_cong_tree_walk(addr, map); + spin_unlock_irqrestore(&rds_cong_lock, flags); + + if (ret == NULL) { + ret = map; + map = NULL; + } + +out: + if (map) { + for (i = 0; i < RDS_CONG_MAP_PAGES && map->m_page_addrs[i]; i++) + free_page(map->m_page_addrs[i]); + kfree(map); + } + + rdsdebug("map %p for addr %x\n", ret, be32_to_cpu(addr)); + + return ret; +} + +/* + * Put the conn on its local map's list. This is called when the conn is + * really added to the hash. It's nested under the rds_conn_lock, sadly. + */ +void rds_cong_add_conn(struct rds_connection *conn) +{ + unsigned long flags; + + rdsdebug("conn %p now on map %p\n", conn, conn->c_lcong); + spin_lock_irqsave(&rds_cong_lock, flags); + list_add_tail(&conn->c_map_item, &conn->c_lcong->m_conn_list); + spin_unlock_irqrestore(&rds_cong_lock, flags); +} + +void rds_cong_remove_conn(struct rds_connection *conn) +{ + unsigned long flags; + + rdsdebug("removing conn %p from map %p\n", conn, conn->c_lcong); + spin_lock_irqsave(&rds_cong_lock, flags); + list_del_init(&conn->c_map_item); + spin_unlock_irqrestore(&rds_cong_lock, flags); +} + +int rds_cong_get_maps(struct rds_connection *conn) +{ + conn->c_lcong = rds_cong_from_addr(conn->c_laddr); + conn->c_fcong = rds_cong_from_addr(conn->c_faddr); + + if (conn->c_lcong == NULL || conn->c_fcong == NULL) + return -ENOMEM; + + return 0; +} + +void rds_cong_queue_updates(struct rds_cong_map *map) +{ + struct rds_connection *conn; + unsigned long flags; + + spin_lock_irqsave(&rds_cong_lock, flags); + + list_for_each_entry(conn, &map->m_conn_list, c_map_item) { + if (!test_and_set_bit(0, &conn->c_map_queued)) { + rds_stats_inc(s_cong_update_queued); + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + } + } + + spin_unlock_irqrestore(&rds_cong_lock, flags); +} + +void rds_cong_map_updated(struct rds_cong_map *map, uint64_t portmask) +{ + rdsdebug("waking map %p\n", map); + rdstrace(RDS_CONG, RDS_LOW, "waking map %p for %u.%u.%u.%u\n", + map, NIPQUAD(map->m_addr)); + rds_stats_inc(s_cong_update_received); + atomic_inc(&rds_cong_generation); + if (waitqueue_active(&map->m_waitq)) + wake_up(&map->m_waitq); + if (waitqueue_active(&rds_poll_waitq)) + wake_up_all(&rds_poll_waitq); + + if (portmask && !list_empty(&rds_cong_monitor)) { + unsigned long flags; + struct rds_sock *rs; + + read_lock_irqsave(&rds_cong_monitor_lock, flags); + list_for_each_entry(rs, &rds_cong_monitor, rs_cong_list) { + spin_lock(&rs->rs_lock); + rs->rs_cong_notify |= (rs->rs_cong_mask & portmask); + rs->rs_cong_mask &= ~portmask; + spin_unlock(&rs->rs_lock); + if (rs->rs_cong_notify) + rds_wake_sk_sleep(rs); + } + read_unlock_irqrestore(&rds_cong_monitor_lock, flags); + } +} +EXPORT_SYMBOL_GPL(rds_cong_map_updated); + +int rds_cong_updated_since(unsigned long *recent) +{ + unsigned long gen = atomic_read(&rds_cong_generation); + + if (likely(*recent == gen)) + return 0; + *recent = gen; + return 1; +} + +/* + * These should be using generic_{test,__{clear,set}}_le_bit() but some old + * kernels don't have them. Sigh. + */ +#if defined(__BIG_ENDIAN) +# define LE_BIT_XOR ((BITS_PER_LONG-1) & ~0x7) +#else +# if !defined(__LITTLE_ENDIAN) +# error "asm/byteorder.h didn't define __BIG or __LITTLE _ENDIAN ?" +# endif +# define LE_BIT_XOR 0 +#endif + +/* + * We're called under the locking that protects the sockets receive buffer + * consumption. This makes it a lot easier for the caller to only call us + * when it knows that an existing set bit needs to be cleared, and vice versa. + * We can't block and we need to deal with concurrent sockets working against + * the same per-address map. + */ +void rds_cong_set_bit(struct rds_cong_map *map, __be16 port) +{ + unsigned long i; + unsigned long off; + + rdsdebug("setting port %u on map %p\n", be16_to_cpu(port), map); + rdstrace(RDS_CONG, RDS_LOW, + "setting congestion for %u.%u.%u.%u:%u in map %p\n", + NIPQUAD(map->m_addr), ntohs(port), map); + + i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; + off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; + + set_bit(off ^ LE_BIT_XOR, (void *)map->m_page_addrs[i]); +} + +void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port) +{ + unsigned long i; + unsigned long off; + + rdsdebug("clearing port %u on map %p\n", be16_to_cpu(port), map); + rdstrace(RDS_CONG, RDS_LOW, + "clearing congestion for %u.%u.%u.%u:%u in map %p\n", + NIPQUAD(map->m_addr), ntohs(port), map); + + i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; + off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; + + clear_bit(off ^ LE_BIT_XOR, (void *)map->m_page_addrs[i]); +} + +static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port) +{ + unsigned long i; + unsigned long off; + + i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS; + off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS; + + return test_bit(off ^ LE_BIT_XOR, (void *)map->m_page_addrs[i]); +} + +#undef LE_BIT_XOR + +void rds_cong_add_socket(struct rds_sock *rs) +{ + unsigned long flags; + + write_lock_irqsave(&rds_cong_monitor_lock, flags); + if (list_empty(&rs->rs_cong_list)) + list_add(&rs->rs_cong_list, &rds_cong_monitor); + write_unlock_irqrestore(&rds_cong_monitor_lock, flags); +} + +void rds_cong_remove_socket(struct rds_sock *rs) +{ + unsigned long flags; + struct rds_cong_map *map; + + write_lock_irqsave(&rds_cong_monitor_lock, flags); + list_del_init(&rs->rs_cong_list); + write_unlock_irqrestore(&rds_cong_monitor_lock, flags); + + /* update congestion map for now-closed port */ + spin_lock_irqsave(&rds_cong_lock, flags); + map = rds_cong_tree_walk(rs->rs_bound_addr, NULL); + spin_unlock_irqrestore(&rds_cong_lock, flags); + + if (map && rds_cong_test_bit(map, rs->rs_bound_port)) + { + rds_cong_clear_bit(map, rs->rs_bound_port); + rds_cong_queue_updates(map); + } +} + +int rds_cong_wait(struct rds_cong_map *map, __be16 port, int nonblock, + struct rds_sock *rs) +{ + if (!rds_cong_test_bit(map, port)) + return 0; + if (nonblock) { + if (rs && rs->rs_cong_monitor) { + unsigned long flags; + + /* It would have been nice to have an atomic set_bit on + * a uint64_t. */ + spin_lock_irqsave(&rs->rs_lock, flags); + rs->rs_cong_mask |= RDS_CONG_MONITOR_MASK(ntohs(port)); + spin_unlock_irqrestore(&rs->rs_lock, flags); + + /* Test again - a congestion update may have arrived in + * the meantime. */ + if (!rds_cong_test_bit(map, port)) + return 0; + } + rds_stats_inc(s_cong_send_error); + return -ENOBUFS; + } + + rds_stats_inc(s_cong_send_blocked); + rdsdebug("waiting on map %p for port %u\n", map, be16_to_cpu(port)); + + return wait_event_interruptible(map->m_waitq, + !rds_cong_test_bit(map, port)); +} + +void rds_cong_exit(void) +{ + struct rb_node *node; + struct rds_cong_map *map; + unsigned long i; + + while ((node = rb_first(&rds_cong_tree))) { + map = rb_entry(node, struct rds_cong_map, m_rb_node); + rdsdebug("freeing map %p\n", map); + rb_erase(&map->m_rb_node, &rds_cong_tree); + for (i = 0; i < RDS_CONG_MAP_PAGES && map->m_page_addrs[i]; i++) + free_page(map->m_page_addrs[i]); + kfree(map); + } +} + +/* + * Allocate a RDS message containing a congestion update. + */ +struct rds_message *rds_cong_update_alloc(struct rds_connection *conn) +{ + struct rds_cong_map *map = conn->c_lcong; + struct rds_message *rm; + + rm = rds_message_map_pages(map->m_page_addrs, RDS_CONG_MAP_BYTES); + if (!IS_ERR(rm)) + rm->m_inc.i_hdr.h_flags = RDS_FLAG_CONG_BITMAP; + + return rm; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:41 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:41 -0800 Subject: [ofa-general] [PATCH 04/21] RDS: Transport code In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-5-git-send-email-andy.grover@oracle.com> RDS supports multiple transports. While this initial submission only supports Infiniband transport, this abstraction allows others to be added. We're working on an iWARP transport, and also see UDP over DCB as another possibility. This code handles transport registration. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/transport.c | 134 ++++++++++++++++++++++++++++++++ 1 files changed, 134 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/transport.c diff --git a/drivers/infiniband/ulp/rds/transport.c b/drivers/infiniband/ulp/rds/transport.c new file mode 100644 index 0000000..e78f8b3 --- /dev/null +++ b/drivers/infiniband/ulp/rds/transport.c @@ -0,0 +1,134 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "loop.h" + +static LIST_HEAD(transports); +static DECLARE_RWSEM(trans_sem); + +int rds_trans_register(struct rds_transport *trans) +{ + BUG_ON(strlen(trans->t_name) + 1 > + sizeof(((struct rds_info_connection *)0)->transport)); + + down_write(&trans_sem); + + list_add_tail(&trans->t_item, &transports); + printk(KERN_INFO "Registered RDS/%s transport\n", trans->t_name); + + up_write(&trans_sem); + + return 0; +} +EXPORT_SYMBOL_GPL(rds_trans_register); + +void rds_trans_unregister(struct rds_transport *trans) +{ + down_write(&trans_sem); + + list_del_init(&trans->t_item); + printk(KERN_INFO "Unregistered RDS/%s transport\n", trans->t_name); + + up_write(&trans_sem); +} +EXPORT_SYMBOL_GPL(rds_trans_unregister); + +struct rds_transport *rds_trans_get_preferred(__be32 addr) +{ + struct rds_transport *trans; + struct rds_transport *ret = NULL; + + if (IN_LOOPBACK(ntohl(addr))) + return &rds_loop_transport; + + down_read(&trans_sem); + list_for_each_entry(trans, &transports, t_item) { + if (trans->laddr_check(addr) == 0) { + ret = trans; + break; + } + } + up_read(&trans_sem); + + return ret; +} + +/* TODO remove this, only called in rds_exit, by which point we know + all the transports will have been unloaded */ +void rds_trans_stop_listening(void) +{ + struct rds_transport *trans; + + down_read(&trans_sem); + + list_for_each_entry(trans, &transports, t_item) + trans->listen_stop(); + + up_read(&trans_sem); +} + +/* + * This returns the number of stats entries in the snapshot and only + * copies them using the iter if there is enough space for them. The + * caller passes in the global stats so that we can size and copy while + * holding the lock. + */ +unsigned int rds_trans_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail) + +{ + struct rds_transport *trans; + unsigned int total = 0; + unsigned int part; + + rds_info_iter_unmap(iter); + down_read(&trans_sem); + + list_for_each_entry(trans, &transports, t_item) { + if (trans->stats_info_copy == NULL) + continue; + + part = trans->stats_info_copy(iter, avail); + avail -= min(avail, part); + total += part; + } + + up_read(&trans_sem); + + return total; +} + -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:42 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:42 -0800 Subject: [ofa-general] [PATCH 05/21] RDS: Info and stats In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-6-git-send-email-andy.grover@oracle.com> RDS currently generates a lot of stats that are accessible via the rds-info utility. This code implements the support for this. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/info.c | 243 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/info.h | 43 +++++++ drivers/infiniband/ulp/rds/stats.c | 150 ++++++++++++++++++++++ 3 files changed, 436 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/info.c create mode 100644 drivers/infiniband/ulp/rds/info.h create mode 100644 drivers/infiniband/ulp/rds/stats.c diff --git a/drivers/infiniband/ulp/rds/info.c b/drivers/infiniband/ulp/rds/info.c new file mode 100644 index 0000000..ff3ba1c --- /dev/null +++ b/drivers/infiniband/ulp/rds/info.c @@ -0,0 +1,243 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" + +/* + * This file implements a getsockopt() call which copies a set of fixed + * sized structs into a user-specified buffer as a means of providing + * read-only information about RDS. + * + * For a given information source there are a given number of fixed sized + * structs at a given time. The structs are only copied if the user-specified + * buffer is big enough. The destination pages that make up the buffer + * are pinned for the duration of the copy. + * + * This gives us the following benefits: + * + * - simple implementation, no copy "position" across multiple calls + * - consistent snapshot of an info source + * - atomic copy works well with whatever locking info source has + * - one portable tool to get rds info across implementations + * - long-lived tool can get info without allocating + * + * at the following costs: + * + * - info source copy must be pinned, may be "large" + */ + +struct rds_info_iterator { + struct page **pages; + void *addr; + unsigned long offset; +}; + +static DEFINE_SPINLOCK(rds_info_lock); +static rds_info_func rds_info_funcs[RDS_INFO_LAST - RDS_INFO_FIRST + 1]; + +void rds_info_register_func(int optname, rds_info_func func) +{ + int offset = optname - RDS_INFO_FIRST; + + BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST); + + spin_lock(&rds_info_lock); + BUG_ON(rds_info_funcs[offset] != NULL); + rds_info_funcs[offset] = func; + spin_unlock(&rds_info_lock); +} +EXPORT_SYMBOL_GPL(rds_info_register_func); + +void rds_info_deregister_func(int optname, rds_info_func func) +{ + int offset = optname - RDS_INFO_FIRST; + + BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST); + + spin_lock(&rds_info_lock); + BUG_ON(rds_info_funcs[offset] != func); + rds_info_funcs[offset] = NULL; + spin_unlock(&rds_info_lock); +} +EXPORT_SYMBOL_GPL(rds_info_deregister_func); + +/* + * Typically we hold an atomic kmap across multiple rds_info_copy() calls + * because the kmap is so expensive. This must be called before using blocking + * operations while holding the mapping and as the iterator is torn down. + */ +void rds_info_iter_unmap(struct rds_info_iterator *iter) +{ + if (iter->addr != NULL) { + kunmap_atomic(iter->addr, KM_USER0); + iter->addr = NULL; + } +} + +/* + * get_user_pages() called flush_dcache_page() on the pages for us. + */ +void rds_info_copy(struct rds_info_iterator *iter, void *data, + unsigned long bytes) +{ + unsigned long this; + + while (bytes) { + if (iter->addr == NULL) + iter->addr = kmap_atomic(*iter->pages, KM_USER0); + + this = min(bytes, PAGE_SIZE - iter->offset); + + rdsdebug("page %p addr %p offset %lu this %lu data %p " + "bytes %lu\n", *iter->pages, iter->addr, + iter->offset, this, data, bytes); + + memcpy(iter->addr + iter->offset, data, this); + + data += this; + bytes -= this; + iter->offset += this; + + if (iter->offset == PAGE_SIZE) { + kunmap_atomic(iter->addr, KM_USER0); + iter->addr = NULL; + iter->offset = 0; + iter->pages++; + } + } +} + +/* + * @optval points to the userspace buffer that the information snapshot + * will be copied into. + * + * @optlen on input is the size of the buffer in userspace. @optlen + * on output is the size of the requested snapshot in bytes. + * + * This function returns -errno if there is a failure, particularly -ENOSPC + * if the given userspace buffer was not large enough to fit the snapshot. + * On success it returns the positive number of bytes of each array element + * in the snapshot. + */ +int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval, + int __user *optlen) +{ + struct rds_info_iterator iter; + struct rds_info_lengths lens; + unsigned long nr_pages = 0; + unsigned long start; + unsigned long i; + rds_info_func func; + struct page **pages = NULL; + int ret; + int len; + int total; + + if (get_user(len, optlen)) { + ret = -EFAULT; + goto out; + } + + /* check for all kinds of wrapping and the like */ + start = (unsigned long)optval; + if (len < 0 || len + PAGE_SIZE - 1 < len || start + len < start) { + ret = -EINVAL; + goto out; + } + + /* a 0 len call is just trying to probe its length */ + if (len == 0) + goto call_func; + + nr_pages = (PAGE_ALIGN(start + len) - (start & PAGE_MASK)) + >> PAGE_SHIFT; + + pages = kmalloc(nr_pages * sizeof(struct page *), GFP_KERNEL); + if (pages == NULL) { + ret = -ENOMEM; + goto out; + } + down_read(¤t->mm->mmap_sem); + ret = get_user_pages(current, current->mm, start, nr_pages, 1, 0, + pages, NULL); + up_read(¤t->mm->mmap_sem); + if (ret != nr_pages) { + if (ret > 0) + nr_pages = ret; + else + nr_pages = 0; + ret = -EAGAIN; /* XXX ? */ + goto out; + } + + rdsdebug("len %d nr_pages %lu\n", len, nr_pages); + +call_func: + func = rds_info_funcs[optname - RDS_INFO_FIRST]; + if (func == NULL) { + ret = -ENOPROTOOPT; + goto out; + } + + iter.pages = pages; + iter.addr = NULL; + iter.offset = start & (PAGE_SIZE - 1); + + func(sock, len, &iter, &lens); + BUG_ON(lens.each == 0); + + total = lens.nr * lens.each; + + rds_info_iter_unmap(&iter); + + if (total > len) { + len = total; + ret = -ENOSPC; + } else { + len = total; + ret = lens.each; + } + + if (put_user(len, optlen)) + ret = -EFAULT; + +out: + for (i = 0; pages != NULL && i < nr_pages; i++) + put_page(pages[i]); + kfree(pages); + + return ret; +} diff --git a/drivers/infiniband/ulp/rds/info.h b/drivers/infiniband/ulp/rds/info.h new file mode 100644 index 0000000..dd0c285 --- /dev/null +++ b/drivers/infiniband/ulp/rds/info.h @@ -0,0 +1,43 @@ +#ifndef _RDS_INFO_H +#define _RDS_INFO_H + +/* FIXME remove these */ +#define RDS_INFO_COUNTERS 10000 +#define RDS_INFO_CONNECTIONS 10001 +#define RDS_INFO_FLOWS 10002 +#define RDS_INFO_SEND_MESSAGES 10003 +#define RDS_INFO_RETRANS_MESSAGES 10004 +#define RDS_INFO_RECV_MESSAGES 10005 +#define RDS_INFO_SOCKETS 10006 +#define RDS_INFO_TCP_SOCKETS 10007 + +#define RDS_INFO_FIRST RDS_INFO_COUNTERS +#define RDS_INFO_LAST RDS_INFO_CONNECTION_STATS + +struct rds_info_lengths { + unsigned int nr; + unsigned int each; +}; + +struct rds_info_iterator; + +/* + * These functions must fill in the fields of @lens to reflect the size + * of the available info source. If the snapshot fits in @len then it + * should be copied using @iter. The caller will deduce if it was copied + * or not by comparing the lengths. + */ +typedef void (*rds_info_func)(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens); + +void rds_info_register_func(int optname, rds_info_func func); +void rds_info_deregister_func(int optname, rds_info_func func); +int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval, + int __user *optlen); +void rds_info_copy(struct rds_info_iterator *iter, void *data, + unsigned long bytes); +void rds_info_iter_unmap(struct rds_info_iterator *iter); + + +#endif diff --git a/drivers/infiniband/ulp/rds/stats.c b/drivers/infiniband/ulp/rds/stats.c new file mode 100644 index 0000000..74f1f96 --- /dev/null +++ b/drivers/infiniband/ulp/rds/stats.c @@ -0,0 +1,150 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" + +DEFINE_PER_CPU(struct rds_statistics, rds_stats) ____cacheline_aligned; +EXPORT_PER_CPU_SYMBOL_GPL(rds_stats); + +/* :.,$s/unsigned long\>.*\= sizeof(ctr.name)); + strncpy(ctr.name, names[i], sizeof(ctr.name) - 1); + ctr.value = values[i]; + + rds_info_copy(iter, &ctr, sizeof(ctr)); + } +} +EXPORT_SYMBOL_GPL(rds_stats_info_copy); + +/* + * This gives global counters across all the transports. The strings + * are copied in so that the tool doesn't need knowledge of the specific + * stats that we're exporting. Some are pretty implementation dependent + * and may change over time. That doesn't stop them from being useful. + * + * This is the only function in the chain that knows about the byte granular + * length in userspace. It converts it to number of stat entries that the + * rest of the functions operate in. + */ +static void rds_stats_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + struct rds_statistics stats = {0, }; + uint64_t *src; + uint64_t *sum; + size_t i; + int cpu; + unsigned int avail; + + avail = len / sizeof(struct rds_info_counter); + + if (avail < ARRAY_SIZE(rds_stat_names)) { + avail = 0; + goto trans; + } + + for_each_online_cpu(cpu) { + src = (uint64_t *)&(per_cpu(rds_stats, cpu)); + sum = (uint64_t *)&stats; + for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++) + *(sum++) += *(src++); + } + + rds_stats_info_copy(iter, (uint64_t *)&stats, rds_stat_names, + ARRAY_SIZE(rds_stat_names)); + avail -= ARRAY_SIZE(rds_stat_names); + +trans: + lens->each = sizeof(struct rds_info_counter); + lens->nr = rds_trans_stats_info_copy(iter, avail) + + ARRAY_SIZE(rds_stat_names); +} + +void rds_stats_exit(void) +{ + rds_info_deregister_func(RDS_INFO_COUNTERS, rds_stats_info); +} + +int __init rds_stats_init(void) +{ + rds_info_register_func(RDS_INFO_COUNTERS, rds_stats_info); + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:43 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:43 -0800 Subject: [ofa-general] [PATCH 06/21] RDS: Connection handling In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-7-git-send-email-andy.grover@oracle.com> While arguably the fact that the underlying transport needs a connection to convey RDS's datagrame reliably is not important to rds proper, the transports implemented so far (IB and TCP) have both been connection-oriented, and so the connection state machine-related code is in the common rds code. This patch also includes several work items, to handle connecting, sending, receiving, and shutdown. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/connection.c | 501 +++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/threads.c | 273 +++++++++++++++++ 2 files changed, 774 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/connection.c create mode 100644 drivers/infiniband/ulp/rds/threads.c diff --git a/drivers/infiniband/ulp/rds/connection.c b/drivers/infiniband/ulp/rds/connection.c new file mode 100644 index 0000000..6174629 --- /dev/null +++ b/drivers/infiniband/ulp/rds/connection.c @@ -0,0 +1,501 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "loop.h" +#include "rdma.h" + +#define RDS_CONNECTION_HASH_BITS 12 +#define RDS_CONNECTION_HASH_ENTRIES (1 << RDS_CONNECTION_HASH_BITS) +#define RDS_CONNECTION_HASH_MASK (RDS_CONNECTION_HASH_ENTRIES - 1) + +/* converting this to RCU is a chore for another day.. */ +static DEFINE_SPINLOCK(rds_conn_lock); +static unsigned long rds_conn_count; +static struct hlist_head rds_conn_hash[RDS_CONNECTION_HASH_ENTRIES]; +static struct kmem_cache *rds_conn_slab; + +static struct hlist_head *rds_conn_bucket(__be32 laddr, __be32 faddr) +{ + /* Pass NULL, don't need struct net for hash */ + unsigned long hash = inet_ehashfn(NULL, + be32_to_cpu(laddr), 0, + be32_to_cpu(faddr), 0); + return &rds_conn_hash[hash & RDS_CONNECTION_HASH_MASK]; +} + +#define rds_conn_info_set(var, test, suffix) do { \ + if (test) \ + var |= RDS_INFO_CONNECTION_FLAG_##suffix; \ +} while (0) + +static inline int rds_conn_is_sending(struct rds_connection *conn) +{ + int ret = 0; + + if (!mutex_trylock(&conn->c_send_lock)) + ret = 1; + else + mutex_unlock(&conn->c_send_lock); + + return ret; +} + +static struct rds_connection *rds_conn_lookup(struct hlist_head *head, + __be32 laddr, __be32 faddr, + struct rds_transport *trans) +{ + struct rds_connection *conn, *ret = NULL; + struct hlist_node *pos; + + hlist_for_each_entry(conn, pos, head, c_hash_node) { + if (conn->c_faddr == faddr && conn->c_laddr == laddr && + conn->c_trans == trans) { + ret = conn; + break; + } + } + rdsdebug("returning conn %p for %u.%u.%u.%u -> %u.%u.%u.%u\n", ret, + NIPQUAD(laddr), NIPQUAD(faddr)); + return ret; +} + +/* + * This is called by transports as they're bringing down a connection. + * It clears partial message state so that the transport can start sending + * and receiving over this connection again in the future. It is up to + * the transport to have serialized this call with its send and recv. + */ +void rds_conn_reset(struct rds_connection *conn) +{ + rdstrace(RDS_CONNECTION, RDS_MINIMAL, + "connection %u.%u.%u.%u to %u.%u.%u.%u reset\n", + NIPQUAD(conn->c_laddr),NIPQUAD(conn->c_faddr)); + + rds_stats_inc(s_conn_reset); + rds_send_reset(conn); + conn->c_flags = 0; + + /* Do not clear next_rx_seq here, else we cannot distinguish + * retransmitted packets from new packets, and will hand all + * of them to the application. That is not consistent with the + * reliability guarantees of RDS. */ +} + +/* + * There is only every one 'conn' for a given pair of addresses in the + * system at a time. They contain messages to be retransmitted and so + * span the lifetime of the actual underlying transport connections. + * + * For now they are not garbage collected once they're created. They + * are torn down as the module is removed, if ever. + */ +static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp, + int is_outgoing) +{ + struct rds_connection *conn, *tmp, *parent = NULL; + struct hlist_head *head = rds_conn_bucket(laddr, faddr); + unsigned long flags; + int ret; + + spin_lock_irqsave(&rds_conn_lock, flags); + conn = rds_conn_lookup(head, laddr, faddr, trans); + if (conn + && conn->c_loopback + && conn->c_trans != &rds_loop_transport + && !is_outgoing) { + /* This is a looped back IB connection, and we're + * called by the code handling the incoming connect. + * We need a second connection object into which we + * can stick the other QP. */ + parent = conn; + conn = parent->c_passive; + } + spin_unlock_irqrestore(&rds_conn_lock, flags); + if (conn) + goto out; + + conn = kmem_cache_alloc(rds_conn_slab, gfp); + if (conn == NULL) { + conn = ERR_PTR(-ENOMEM); + goto out; + } + + memset(conn, 0, sizeof(*conn)); + + INIT_HLIST_NODE(&conn->c_hash_node); + conn->c_version = RDS_PROTOCOL_3_0; + conn->c_laddr = laddr; + conn->c_faddr = faddr; + spin_lock_init(&conn->c_lock); + conn->c_next_tx_seq = 1; + + mutex_init(&conn->c_send_lock); + INIT_LIST_HEAD(&conn->c_send_queue); + INIT_LIST_HEAD(&conn->c_retrans); + + ret = rds_cong_get_maps(conn); + if (ret) { + kmem_cache_free(rds_conn_slab, conn); + conn = ERR_PTR(ret); + goto out; + } + + /* + * This is where a connection becomes loopback. If *any* RDS sockets + * can bind to the destination address then we'd rather the messages + * flow through loopback rather than either transport. + */ + if (rds_trans_get_preferred(faddr)) { + conn->c_loopback = 1; + if (is_outgoing && trans->t_prefer_loopback) { + /* "outgoing" connection - and the transport + * says it wants the connection handled by the + * loopback transport. This is what TCP does. + */ + trans = &rds_loop_transport; + } + } + + conn->c_trans = trans; + + ret = trans->conn_alloc(conn, gfp); + if (ret) { + kmem_cache_free(rds_conn_slab, conn); + conn = ERR_PTR(ret); + goto out; + } + + atomic_set(&conn->c_state, RDS_CONN_DOWN); + conn->c_reconnect_jiffies = 0; + INIT_DELAYED_WORK(&conn->c_send_w, rds_send_worker); + INIT_DELAYED_WORK(&conn->c_recv_w, rds_recv_worker); + INIT_DELAYED_WORK(&conn->c_conn_w, rds_connect_worker); + INIT_WORK(&conn->c_down_w, rds_shutdown_worker); + mutex_init(&conn->c_cm_lock); + conn->c_flags = 0; + + rdsdebug("allocated conn %p for %u.%u.%u.%u -> %u.%u.%u.%u\n", conn, + NIPQUAD(laddr), NIPQUAD(faddr)); + + rdstrace(RDS_CONNECTION, RDS_MINIMAL, + "allocated conn %p for %u.%u.%u.%u -> %u.%u.%u.%u over %s %s\n", + conn, NIPQUAD(laddr), NIPQUAD(faddr), + trans->t_name ? trans->t_name : "[unknown]", + is_outgoing ? "(outgoing)" : ""); + + spin_lock_irqsave(&rds_conn_lock, flags); + if (parent == NULL) { + tmp = rds_conn_lookup(head, laddr, faddr, trans); + if (tmp == NULL) + hlist_add_head(&conn->c_hash_node, head); + } else { + tmp = parent->c_passive; + if (!tmp) + parent->c_passive = conn; + } + + if (tmp) { + trans->conn_free(conn->c_transport_data); + kmem_cache_free(rds_conn_slab, conn); + conn = tmp; + } else { + rds_cong_add_conn(conn); + rds_conn_count++; + } + + spin_unlock_irqrestore(&rds_conn_lock, flags); + +out: + return conn; +} + +struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp) +{ + return __rds_conn_create(laddr, faddr, trans, gfp, 0); +} +EXPORT_SYMBOL_GPL(rds_conn_create); + +struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr, + struct rds_transport *trans, gfp_t gfp) +{ + return __rds_conn_create(laddr, faddr, trans, gfp, 1); +} +EXPORT_SYMBOL_GPL(rds_conn_create_outgoing); + +void rds_conn_destroy(struct rds_connection *conn) +{ + struct rds_message *rm, *rtmp; + + rdsdebug("freeing conn %p for %u.%u.%u.%u -> " + "%u.%u.%u.%u\n", conn, NIPQUAD(conn->c_laddr), + NIPQUAD(conn->c_faddr)); + + rdstrace(RDS_CONNECTION, RDS_MINIMAL, + "freeing conn %p for %u.%u.%u.%u -> %u.%u.%u.%u\n", + conn, NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr)); + + hlist_del_init(&conn->c_hash_node); + + /* wait for the rds thread to shut it down */ + atomic_set(&conn->c_state, RDS_CONN_ERROR); + cancel_delayed_work(&conn->c_conn_w); + queue_work(rds_wq, &conn->c_down_w); + flush_workqueue(rds_wq); + + /* tear down queued messages */ + list_for_each_entry_safe(rm, rtmp, + &conn->c_send_queue, + m_conn_item) { + list_del_init(&rm->m_conn_item); + BUG_ON(!list_empty(&rm->m_sock_item)); + rds_message_put(rm); + } + if (conn->c_xmit_rm) + rds_message_put(conn->c_xmit_rm); + + conn->c_trans->conn_free(conn->c_transport_data); + + /* + * The congestion maps aren't freed up here. They're + * freed by rds_cong_exit() after all the connections + * have been freed. + */ + rds_cong_remove_conn(conn); + + BUG_ON(!list_empty(&conn->c_retrans)); + kmem_cache_free(rds_conn_slab, conn); + + rds_conn_count--; +} +EXPORT_SYMBOL_GPL(rds_conn_destroy); + +static void rds_conn_message_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens, + int want_send) +{ + struct hlist_head *head; + struct hlist_node *pos; + struct list_head *list; + struct rds_connection *conn; + struct rds_message *rm; + unsigned long flags; + unsigned int total = 0; + size_t i; + + len /= sizeof(struct rds_info_message); + + spin_lock_irqsave(&rds_conn_lock, flags); + + for (i = 0, head = rds_conn_hash; i < ARRAY_SIZE(rds_conn_hash); + i++, head++) { + hlist_for_each_entry(conn, pos, head, c_hash_node) { + if (want_send) + list = &conn->c_send_queue; + else + list = &conn->c_retrans; + + spin_lock(&conn->c_lock); + + /* XXX too lazy to maintain counts.. */ + list_for_each_entry(rm, list, m_conn_item) { + total++; + if (total <= len) + rds_inc_info_copy(&rm->m_inc, iter, + conn->c_laddr, + conn->c_faddr, 0); + } + + spin_unlock(&conn->c_lock); + } + } + + spin_unlock_irqrestore(&rds_conn_lock, flags); + + lens->nr = total; + lens->each = sizeof(struct rds_info_message); +} + +static void rds_conn_message_info_send(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_conn_message_info(sock, len, iter, lens, 1); +} + +static void rds_conn_message_info_retrans(struct socket *sock, + unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_conn_message_info(sock, len, iter, lens, 0); +} + +void rds_for_each_conn_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens, + int (*visitor)(struct rds_connection *, void *), + size_t item_len) +{ + uint64_t buffer[(item_len + 7) / 8]; + struct hlist_head *head; + struct hlist_node *pos; + struct hlist_node *tmp; + struct rds_connection *conn; + unsigned long flags; + size_t i; + + spin_lock_irqsave(&rds_conn_lock, flags); + + lens->nr = 0; + lens->each = item_len; + + for (i = 0, head = rds_conn_hash; i < ARRAY_SIZE(rds_conn_hash); + i++, head++) { + hlist_for_each_entry_safe(conn, pos, tmp, head, c_hash_node) { + + /* XXX no c_lock usage.. */ + if (!visitor(conn, buffer)) + continue; + + /* We copy as much as we can fit in the buffer, + * but we count all items so that the caller + * can resize the buffer. */ + if (len >= item_len) { + rds_info_copy(iter, buffer, item_len); + len -= item_len; + } + lens->nr++; + } + } + + spin_unlock_irqrestore(&rds_conn_lock, flags); +} +EXPORT_SYMBOL_GPL(rds_for_each_conn_info); + +static int rds_conn_info_visitor(struct rds_connection *conn, + void *buffer) +{ + struct rds_info_connection *cinfo = buffer; + + cinfo->next_tx_seq = conn->c_next_tx_seq; + cinfo->next_rx_seq = conn->c_next_rx_seq; + cinfo->laddr = conn->c_laddr; + cinfo->faddr = conn->c_faddr; + strncpy(cinfo->transport, conn->c_trans->t_name, + sizeof(cinfo->transport)); + cinfo->flags = 0; + + rds_conn_info_set(cinfo->flags, + rds_conn_is_sending(conn), SENDING); + /* XXX Future: return the state rather than these funky bits */ + rds_conn_info_set(cinfo->flags, + atomic_read(&conn->c_state) == RDS_CONN_CONNECTING, + CONNECTING); + rds_conn_info_set(cinfo->flags, + atomic_read(&conn->c_state) == RDS_CONN_UP, + CONNECTED); + return 1; +} + +static void rds_conn_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_for_each_conn_info(sock, len, iter, lens, + rds_conn_info_visitor, + sizeof(struct rds_info_connection)); +} + +int __init rds_conn_init(void) +{ + rds_conn_slab = kmem_cache_create("rds_connection", + sizeof(struct rds_connection), + 0, 0, NULL); + if (rds_conn_slab == NULL) + return -ENOMEM; + + rds_info_register_func(RDS_INFO_CONNECTIONS, rds_conn_info); + rds_info_register_func(RDS_INFO_SEND_MESSAGES, + rds_conn_message_info_send); + rds_info_register_func(RDS_INFO_RETRANS_MESSAGES, + rds_conn_message_info_retrans); + + return 0; +} + +void rds_conn_exit(void) +{ + rds_loop_exit(); + + WARN_ON(!hlist_empty(rds_conn_hash)); + + kmem_cache_destroy(rds_conn_slab); + + rds_info_deregister_func(RDS_INFO_CONNECTIONS, rds_conn_info); + rds_info_deregister_func(RDS_INFO_SEND_MESSAGES, + rds_conn_message_info_send); + rds_info_deregister_func(RDS_INFO_RETRANS_MESSAGES, + rds_conn_message_info_retrans); +} + +/* + * Force a disconnect + */ +void rds_conn_drop(struct rds_connection *conn) +{ + atomic_set(&conn->c_state, RDS_CONN_ERROR); + queue_work(rds_wq, &conn->c_down_w); +} +EXPORT_SYMBOL_GPL(rds_conn_drop); + +/* + * An error occurred on the connection + */ +void +__rds_conn_error(struct rds_connection *conn, const char *fmt, ...) +{ + va_list ap; + + va_start(ap, fmt); + vprintk(fmt, ap); + va_end(ap); + + rds_conn_drop(conn); +} diff --git a/drivers/infiniband/ulp/rds/threads.c b/drivers/infiniband/ulp/rds/threads.c new file mode 100644 index 0000000..0017f90 --- /dev/null +++ b/drivers/infiniband/ulp/rds/threads.c @@ -0,0 +1,273 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include + +#include "rds.h" + +/* + * All of connection management is simplified by serializing it through + * work queues that execute in a connection managing thread. + * + * TCP wants to send acks through sendpage() in response to data_ready(), + * but it needs a process context to do so. + * + * The receive paths need to allocate but can't drop packets (!) so we have + * a thread around to block allocating if the receive fast path sees an + * allocation failure. + */ + +/* Grand Unified Theory of connection life cycle: + * At any point in time, the connection can be in one of these states: + * DOWN, CONNECTING, UP, DISCONNECTING, ERROR + * + * The following transitions are possible: + * ANY -> ERROR + * UP -> DISCONNECTING + * ERROR -> DISCONNECTING + * DISCONNECTING -> DOWN + * DOWN -> CONNECTING + * CONNECTING -> UP + * + * Transition to state DISCONNECTING/DOWN: + * - Inside the shutdown worker; synchronizes with xmit path + * through c_send_lock, and with connection management callbacks + * via c_cm_lock. + * + * For receive callbacks, we rely on the underlying transport + * (TCP, IB/RDMA) to provide the necessary synchronisation. + */ +struct workqueue_struct *rds_wq; +EXPORT_SYMBOL_GPL(rds_wq); + +void rds_connect_complete(struct rds_connection *conn) +{ + if (!rds_conn_transition(conn, RDS_CONN_CONNECTING, RDS_CONN_UP)) { + printk(KERN_WARNING "%s: Cannot transition to state UP, " + "current state is %d\n", + __func__, + atomic_read(&conn->c_state)); + atomic_set(&conn->c_state, RDS_CONN_ERROR); + queue_work(rds_wq, &conn->c_down_w); + return; + } + + rdstrace(RDS_CONNECTION, RDS_MINIMAL, + "conn %p for %u.%u.%u.%u to %u.%u.%u.%u complete\n", + conn, NIPQUAD(conn->c_laddr),NIPQUAD(conn->c_faddr)); + + conn->c_reconnect_jiffies = 0; + set_bit(0, &conn->c_map_queued); + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + queue_delayed_work(rds_wq, &conn->c_recv_w, 0); +} +EXPORT_SYMBOL_GPL(rds_connect_complete); + +/* + * This random exponential backoff is relied on to eventually resolve racing + * connects. + * + * If connect attempts race then both parties drop both connections and come + * here to wait for a random amount of time before trying again. Eventually + * the backoff range will be so much greater than the time it takes to + * establish a connection that one of the pair will establish the connection + * before the other's random delay fires. + * + * Connection attempts that arrive while a connection is already established + * are also considered to be racing connects. This lets a connection from + * a rebooted machine replace an existing stale connection before the transport + * notices that the connection has failed. + * + * We should *always* start with a random backoff; otherwise a broken connection + * will always take several iterations to be re-established. + */ +static void rds_queue_reconnect(struct rds_connection *conn) +{ + unsigned long rand; + + rdstrace(RDS_CONNECTION, RDS_LOW, + "conn %p for %u.%u.%u.%u to %u.%u.%u.%u reconnect jiffies %lu\n", + conn, NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr), + conn->c_reconnect_jiffies); + + set_bit(RDS_RECONNECT_PENDING, &conn->c_flags); + if (conn->c_reconnect_jiffies == 0) { + conn->c_reconnect_jiffies = rds_sysctl_reconnect_min_jiffies; + queue_delayed_work(rds_wq, &conn->c_conn_w, 0); + return; + } + + get_random_bytes(&rand, sizeof(rand)); + rdsdebug("%lu delay %lu ceil conn %p for %u.%u.%u.%u -> %u.%u.%u.%u\n", + rand % conn->c_reconnect_jiffies, conn->c_reconnect_jiffies, + conn, NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr)); + queue_delayed_work(rds_wq, &conn->c_conn_w, + rand % conn->c_reconnect_jiffies); + + conn->c_reconnect_jiffies = min(conn->c_reconnect_jiffies * 2, + rds_sysctl_reconnect_max_jiffies); +} + +void rds_connect_worker(struct work_struct *work) +{ + struct rds_connection *conn = container_of(work, struct rds_connection, c_conn_w.work); + int ret; + + clear_bit(RDS_RECONNECT_PENDING, &conn->c_flags); + if (rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) { + ret = conn->c_trans->conn_connect(conn); + rdsdebug("connect conn %p for %u.%u.%u.%u -> %u.%u.%u.%u " + "ret %d\n", conn, NIPQUAD(conn->c_laddr), + NIPQUAD(conn->c_faddr), ret); + rdstrace(RDS_CONNECTION, RDS_MINIMAL, + "conn %p for %u.%u.%u.%u to %u.%u.%u.%u dispatched, ret %d\n", + conn, NIPQUAD(conn->c_laddr), NIPQUAD(conn->c_faddr), ret); + + if (ret) { + if (rds_conn_transition(conn, RDS_CONN_CONNECTING, RDS_CONN_DOWN)) + rds_queue_reconnect(conn); + else + rds_conn_error(conn, "RDS: connect failed\n"); + } + } +} + +void rds_shutdown_worker(struct work_struct *work) +{ + struct rds_connection *conn = container_of(work, struct rds_connection, c_down_w); + + /* shut it down unless it's down already */ + if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_DOWN)) { + /* + * Quiesce the connection mgmt handlers before we start tearing + * things down. We don't hold the mutex for the entire + * duration of the shutdown operation, else we may be + * deadlocking with the CM handler. Instead, the CM event + * handler is supposed to check for state DISCONNECTING + */ + mutex_lock(&conn->c_cm_lock); + if (!rds_conn_transition(conn, RDS_CONN_UP, RDS_CONN_DISCONNECTING) + && !rds_conn_transition(conn, RDS_CONN_ERROR, RDS_CONN_DISCONNECTING)) { + rds_conn_error(conn, "shutdown called in state %d\n", + atomic_read(&conn->c_state)); + mutex_unlock(&conn->c_cm_lock); + return; + } + mutex_unlock(&conn->c_cm_lock); + + mutex_lock(&conn->c_send_lock); + conn->c_trans->conn_shutdown(conn); + rds_conn_reset(conn); + mutex_unlock(&conn->c_send_lock); + + if (!rds_conn_transition(conn, RDS_CONN_DISCONNECTING, RDS_CONN_DOWN)) { + /* This can happen - eg when we're in the middle of tearing + * down the connection, and someone unloads the rds module. + * Quite reproduceable with loopback connections. + * Mostly harmless. + */ + rds_conn_error(conn, + "%s: failed to transition to state DOWN, " + "current state is %d\n", + __func__, + atomic_read(&conn->c_state)); + return; + } + } + + /* Then reconnect if it's still live. + * The passive side of an IB loopback connection is never added + * to the conn hash, so we never trigger a reconnect on this + * conn - the reconnect is always triggered by the active peer. */ + cancel_delayed_work(&conn->c_conn_w); + if (!hlist_unhashed(&conn->c_hash_node)) + rds_queue_reconnect(conn); +} + +void rds_send_worker(struct work_struct *work) +{ + struct rds_connection *conn = container_of(work, struct rds_connection, c_send_w.work); + int ret; + + if (rds_conn_state(conn) == RDS_CONN_UP) { + ret = rds_send_xmit(conn); + rdsdebug("conn %p ret %d\n", conn, ret); + switch (ret) { + case -EAGAIN: + rds_stats_inc(s_send_immediate_retry); + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + break; + case -ENOMEM: + rds_stats_inc(s_send_delayed_retry); + queue_delayed_work(rds_wq, &conn->c_send_w, 2); + default: + break; + } + } +} + +void rds_recv_worker(struct work_struct *work) +{ + struct rds_connection *conn = container_of(work, struct rds_connection, c_recv_w.work); + int ret; + + if (rds_conn_state(conn) == RDS_CONN_UP) { + ret = conn->c_trans->recv(conn); + rdsdebug("conn %p ret %d\n", conn, ret); + switch (ret) { + case -EAGAIN: + rds_stats_inc(s_recv_immediate_retry); + queue_delayed_work(rds_wq, &conn->c_recv_w, 0); + break; + case -ENOMEM: + rds_stats_inc(s_recv_delayed_retry); + queue_delayed_work(rds_wq, &conn->c_recv_w, 2); + default: + break; + } + } +} + +void rds_threads_exit(void) +{ + destroy_workqueue(rds_wq); +} + +int __init rds_threads_init(void) +{ + rds_wq = create_singlethread_workqueue("krdsd"); + if (rds_wq == NULL) + return -ENOMEM; + + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:45 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:45 -0800 Subject: [ofa-general] [PATCH 08/21] RDS: sysctls In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-9-git-send-email-andy.grover@oracle.com> In addition to some tunable parameters, we also make protocol # available here, since until accepted in upstream we do not have a fixed number assigned. This can be removed once upstream. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/sysctl.c | 164 +++++++++++++++++++++++++++++++++++ 1 files changed, 164 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/sysctl.c diff --git a/drivers/infiniband/ulp/rds/sysctl.c b/drivers/infiniband/ulp/rds/sysctl.c new file mode 100644 index 0000000..3337a3e --- /dev/null +++ b/drivers/infiniband/ulp/rds/sysctl.c @@ -0,0 +1,164 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" + +static struct ctl_table_header *rds_sysctl_reg_table; + +static unsigned long rds_sysctl_reconnect_min = 1; +static unsigned long rds_sysctl_reconnect_max = ~0UL; + +unsigned long rds_sysctl_reconnect_min_jiffies; +unsigned long rds_sysctl_reconnect_max_jiffies = HZ; + +unsigned int rds_sysctl_max_unacked_packets = 8; +unsigned int rds_sysctl_max_unacked_bytes = (16 << 20); + +unsigned int rds_sysctl_ping_enable = 1; + +unsigned long rds_sysctl_trace_flags = 0; +unsigned int rds_sysctl_trace_level = 0; + +/* + * These can change over time until they're official. Until that time we'll + * give apps a way to figure out what the values are in a given machine. + */ +static int rds_sysctl_pf_rds = PF_RDS; +static int rds_sysctl_sol_rds = SOL_RDS; + +static ctl_table rds_sysctl_rds_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "reconnect_min_delay_ms", + .data = &rds_sysctl_reconnect_min_jiffies, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, + .extra1 = &rds_sysctl_reconnect_min, + .extra2 = &rds_sysctl_reconnect_max_jiffies, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "reconnect_max_delay_ms", + .data = &rds_sysctl_reconnect_max_jiffies, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, + .extra1 = &rds_sysctl_reconnect_min_jiffies, + .extra2 = &rds_sysctl_reconnect_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "pf_rds", + .data = &rds_sysctl_pf_rds, + .maxlen = sizeof(int), + .mode = 0444, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "sol_rds", + .data = &rds_sysctl_sol_rds, + .maxlen = sizeof(int), + .mode = 0444, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unacked_packets", + .data = &rds_sysctl_max_unacked_packets, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unacked_bytes", + .data = &rds_sysctl_max_unacked_bytes, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "ping_enable", + .data = &rds_sysctl_ping_enable, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "trace_flags", + .data = &rds_sysctl_trace_flags, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "trace_level", + .data = &rds_sysctl_trace_level, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { .ctl_name = 0} +}; + +static struct ctl_path rds_sysctl_path[] = { + { .procname = "net", .ctl_name = CTL_NET, }, + { .procname = "rds", .ctl_name = CTL_UNNUMBERED, }, + { } +}; + + +void rds_sysctl_exit(void) +{ + if (rds_sysctl_reg_table) + unregister_sysctl_table(rds_sysctl_reg_table); +} + +int __init rds_sysctl_init(void) +{ + rds_sysctl_reconnect_min = msecs_to_jiffies(1); + rds_sysctl_reconnect_min_jiffies = rds_sysctl_reconnect_min; + + rds_sysctl_reg_table = register_sysctl_paths(rds_sysctl_path, rds_sysctl_rds_table); + if (rds_sysctl_reg_table == NULL) + return -ENOMEM; + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:47 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:47 -0800 Subject: [ofa-general] [PATCH 10/21] RDS: send.c In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-11-git-send-email-andy.grover@oracle.com> This is the code to send an RDS datagram. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/send.c | 1006 +++++++++++++++++++++++++++++++++++++ 1 files changed, 1006 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/send.c diff --git a/drivers/infiniband/ulp/rds/send.c b/drivers/infiniband/ulp/rds/send.c new file mode 100644 index 0000000..276f7ac --- /dev/null +++ b/drivers/infiniband/ulp/rds/send.c @@ -0,0 +1,1006 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include + +#include "rds.h" +#include "rdma.h" + +/* When transmitting messages in rds_send_xmit, we need to emerge from + * time to time and briefly release the CPU. Otherwise the softlock watchdog + * will kick our shin. + * Also, it seems fairer to not let one busy connection stall all the + * others. + * + * send_batch_count is the number of times we'll loop in send_xmit. Setting + * it to 0 will restore the old behavior (where we looped until we had + * drained the queue). + */ +static int send_batch_count = 64; +module_param(send_batch_count, int, 0444); +MODULE_PARM_DESC(send_batch_count, " batch factor when working the send queue"); + +/* + * Reset the send state. Caller must hold c_send_lock when calling here. + */ +void rds_send_reset(struct rds_connection *conn) +{ + struct rds_message *rm, *tmp; + unsigned long flags; + + if (conn->c_xmit_rm) { + /* Tell the user the RDMA op is no longer mapped by the + * transport. This isn't entirely true (it's flushed out + * independently) but as the connection is down, there's + * no ongoing RDMA to/from that memory */ + rds_message_unmapped(conn->c_xmit_rm); + rds_message_put(conn->c_xmit_rm); + conn->c_xmit_rm = NULL; + } + conn->c_xmit_sg = 0; + conn->c_xmit_hdr_off = 0; + conn->c_xmit_data_off = 0; + conn->c_xmit_rdma_sent = 0; + + conn->c_map_queued = 0; + + conn->c_unacked_packets = rds_sysctl_max_unacked_packets; + conn->c_unacked_bytes = rds_sysctl_max_unacked_bytes; + + /* Mark messages as retransmissions, and move them to the send q */ + spin_lock_irqsave(&conn->c_lock, flags); + list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) { + set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); + set_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags); + } + list_splice_init(&conn->c_retrans, &conn->c_send_queue); + spin_unlock_irqrestore(&conn->c_lock, flags); +} + +/* + * We're making the concious trade-off here to only send one message + * down the connection at a time. + * Pro: + * - tx queueing is a simple fifo list + * - reassembly is optional and easily done by transports per conn + * - no per flow rx lookup at all, straight to the socket + * - less per-frag memory and wire overhead + * Con: + * - queued acks can be delayed behind large messages + * Depends: + * - small message latency is higher behind queued large messages + * - large message latency isn't starved by intervening small sends + */ +int rds_send_xmit(struct rds_connection *conn) +{ + struct rds_message *rm; + unsigned long flags; + unsigned int tmp; + unsigned int send_quota = send_batch_count; + struct scatterlist *sg; + int ret = 0; + int was_empty = 0; + LIST_HEAD(to_be_dropped); + + /* + * sendmsg calls here after having queued its message on the send + * queue. We only have one task feeding the connection at a time. If + * another thread is already feeding the queue then we back off. This + * avoids blocking the caller and trading per-connection data between + * caches per message. + * + * The sem holder will issue a retry if they notice that someone queued + * a message after they stopped walking the send queue but before they + * dropped the sem. + */ + if (!mutex_trylock(&conn->c_send_lock)) { + rds_stats_inc(s_send_sem_contention); + ret = -ENOMEM; + goto out; + } + + if (conn->c_trans->xmit_prepare) + conn->c_trans->xmit_prepare(conn); + + /* + * spin trying to push headers and data down the connection until + * the connection doens't make forward progress. + */ + while (--send_quota) { + /* + * See if need to send a congestion map update if we're + * between sending messages. The send_sem protects our sole + * use of c_map_offset and _bytes. + * Note this is used only by transports that define a special + * xmit_cong_map function. For all others, we create allocate + * a cong_map message and treat it just like any other send. + */ + if (conn->c_map_bytes) { + ret = conn->c_trans->xmit_cong_map(conn, conn->c_lcong, + conn->c_map_offset); + if (ret <= 0) + break; + + conn->c_map_offset += ret; + conn->c_map_bytes -= ret; + if (conn->c_map_bytes) + continue; + } + + /* If we're done sending the current message, clear the + * offset and S/G temporaries. + */ + rm = conn->c_xmit_rm; + if (rm != NULL && + conn->c_xmit_hdr_off == sizeof(struct rds_header) && + conn->c_xmit_sg == rm->m_nents) { + conn->c_xmit_rm = NULL; + conn->c_xmit_sg = 0; + conn->c_xmit_hdr_off = 0; + conn->c_xmit_data_off = 0; + conn->c_xmit_rdma_sent = 0; + + /* Release the reference to the previous message. */ + rds_message_put(rm); + rm = NULL; + } + + /* If we're asked to send a cong map update, do so. + */ + if (rm == NULL && test_and_clear_bit(0, &conn->c_map_queued)) { + if (conn->c_trans->xmit_cong_map != NULL) { + conn->c_map_offset = 0; + conn->c_map_bytes = sizeof(struct rds_header) + + RDS_CONG_MAP_BYTES; + continue; + } + + rm = rds_cong_update_alloc(conn); + if (IS_ERR(rm)) { + ret = PTR_ERR(rm); + break; + } + + conn->c_xmit_rm = rm; + } + + /* + * Grab the next message from the send queue, if there is one. + * + * c_xmit_rm holds a ref while we're sending this message down + * the connction. We can use this ref while holding the + * send_sem.. rds_send_reset() is serialized with it. + */ + if (rm == NULL) { + unsigned int len; + + spin_lock_irqsave(&conn->c_lock, flags); + + if (!list_empty(&conn->c_send_queue)) { + rm = list_entry(conn->c_send_queue.next, + struct rds_message, + m_conn_item); + rds_message_addref(rm); + + /* + * Move the message from the send queue to the retransmit + * list right away. + */ + list_move_tail(&rm->m_conn_item, &conn->c_retrans); + } + + spin_unlock_irqrestore(&conn->c_lock, flags); + + if (rm == NULL) { + was_empty = 1; + break; + } + + /* Unfortunately, the way Infiniband deals with + * RDMA to a bad MR key is by moving the entire + * queue pair to error state. We cold possibly + * recover from that, but right now we drop the + * connection. + * Therefore, we never retransmit messages with RDMA ops. + */ + if (rm->m_rdma_op + && test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags)) { + spin_lock_irqsave(&conn->c_lock, flags); + if (test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) + list_move(&rm->m_conn_item, &to_be_dropped); + spin_unlock_irqrestore(&conn->c_lock, flags); + rds_message_put(rm); + continue; + } + + /* Require an ACK every once in a while */ + len = ntohl(rm->m_inc.i_hdr.h_len); + if (conn->c_unacked_packets == 0 + || conn->c_unacked_bytes < len) { + __set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); + + conn->c_unacked_packets = rds_sysctl_max_unacked_packets; + conn->c_unacked_bytes = rds_sysctl_max_unacked_bytes; + rds_stats_inc(s_send_ack_required); + } else { + conn->c_unacked_bytes -= len; + conn->c_unacked_packets--; + } + + conn->c_xmit_rm = rm; + } + + /* + * Try and send an rdma message. Let's see if we can + * keep this simple and require that the transport either + * send the whole rdma or none of it. + */ + if (rm->m_rdma_op && !conn->c_xmit_rdma_sent) { + ret = conn->c_trans->xmit_rdma(conn, rm->m_rdma_op); + if (ret) + break; + conn->c_xmit_rdma_sent = 1; + /* The transport owns the mapped memory for now. + * You can't unmap it while it's on the send queue */ + set_bit(RDS_MSG_MAPPED, &rm->m_flags); + } + + if (conn->c_xmit_hdr_off < sizeof(struct rds_header) || + conn->c_xmit_sg < rm->m_nents) { + ret = conn->c_trans->xmit(conn, rm, + conn->c_xmit_hdr_off, + conn->c_xmit_sg, + conn->c_xmit_data_off); + if (ret <= 0) + break; + + if (conn->c_xmit_hdr_off < sizeof(struct rds_header)) { + tmp = min_t(int, ret, + sizeof(struct rds_header) - + conn->c_xmit_hdr_off); + conn->c_xmit_hdr_off += tmp; + ret -= tmp; + } + + sg = &rm->m_sg[conn->c_xmit_sg]; + while (ret) { + tmp = min_t(int, ret, sg->length - + conn->c_xmit_data_off); + conn->c_xmit_data_off += tmp; + ret -= tmp; + if (conn->c_xmit_data_off == sg->length) { + conn->c_xmit_data_off = 0; + sg++; + conn->c_xmit_sg++; + BUG_ON(ret != 0 && + conn->c_xmit_sg == rm->m_nents); + } + } + } + } + + /* Nuke any messages we decided not to retransmit. */ + if (!list_empty(&to_be_dropped)) + rds_send_remove_from_sock(&to_be_dropped, RDS_RDMA_DROPPED); + + if (conn->c_trans->xmit_complete) + conn->c_trans->xmit_complete(conn); + + /* + * We might be racing with another sender who queued a message but + * backed off on noticing that we held the c_send_lock. If we check + * for queued messages after dropping the sem then either we'll + * see the queued message or the queuer will get the sem. If we + * notice the queued message then we trigger an immediate retry. + * + * We need to be careful only to do this when we stopped processing + * the send queue because it was empty. It's the only way we + * stop processing the loop when the transport hasn't taken + * responsibility for forward progress. + */ + mutex_unlock(&conn->c_send_lock); + + if (conn->c_map_bytes || (send_quota == 0 && !was_empty)) { + /* We exhausted the send quota, but there's work left to + * do. Return and (re-)schedule the send worker. + */ + ret = -EAGAIN; + } + + if (ret == 0 && was_empty) { + /* A simple bit test would be way faster than taking the + * spin lock */ + spin_lock_irqsave(&conn->c_lock, flags); + if (!list_empty(&conn->c_send_queue)) { + rds_stats_inc(s_send_sem_queue_raced); + ret = -EAGAIN; + } + spin_unlock_irqrestore(&conn->c_lock, flags); + } +out: + return ret; +} + +static void rds_send_sndbuf_remove(struct rds_sock *rs, struct rds_message *rm) +{ + u32 len = be32_to_cpu(rm->m_inc.i_hdr.h_len); + + assert_spin_locked(&rs->rs_lock); + + BUG_ON(rs->rs_snd_bytes < len); + rs->rs_snd_bytes -= len; + + if (rs->rs_snd_bytes == 0) + rds_stats_inc(s_send_queue_empty); +} + +static inline int rds_send_is_acked(struct rds_message *rm, u64 ack, + is_acked_func is_acked) +{ + if (is_acked) + return is_acked(rm, ack); + return be64_to_cpu(rm->m_inc.i_hdr.h_sequence) <= ack; +} + +/* + * Returns true if there are no messages on the send and retransmit queues + * which have a sequence number greater than or equal to the given sequence + * number. + */ +int rds_send_acked_before(struct rds_connection *conn, u64 seq) +{ + struct rds_message *rm, *tmp; + int ret = 1; + + spin_lock(&conn->c_lock); + + list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) { + if (be64_to_cpu(rm->m_inc.i_hdr.h_sequence) < seq) + ret = 0; + break; + } + + list_for_each_entry_safe(rm, tmp, &conn->c_send_queue, m_conn_item) { + if (be64_to_cpu(rm->m_inc.i_hdr.h_sequence) < seq) + ret = 0; + break; + } + + spin_unlock(&conn->c_lock); + + return ret; +} + +/* + * This is pretty similar to what happens below in the ACK + * handling code - except that we call here as soon as we get + * the IB send completion on the RDMA op and the accompanying + * message. + */ +void rds_rdma_send_complete(struct rds_message *rm, int status) +{ + struct rds_sock *rs = NULL; + struct rds_rdma_op *ro; + struct rds_notifier *notifier; + + spin_lock(&rm->m_rs_lock); + + ro = rm->m_rdma_op; + if (test_bit(RDS_MSG_ON_SOCK, &rm->m_flags) + && ro && ro->r_notify + && (notifier = ro->r_notifier) != NULL) { + rs = rm->m_rs; + sock_hold(rds_rs_to_sk(rs)); + + notifier->n_status = status; + spin_lock(&rs->rs_lock); + list_add_tail(¬ifier->n_list, &rs->rs_notify_queue); + spin_unlock(&rs->rs_lock); + + ro->r_notifier = NULL; + } + + spin_unlock(&rm->m_rs_lock); + + if (rs) { + rds_wake_sk_sleep(rs); + sock_put(rds_rs_to_sk(rs)); + } +} +EXPORT_SYMBOL_GPL(rds_rdma_send_complete); + +/* + * This is the same as rds_rdma_send_complete except we + * don't do any locking - we have all the ingredients (message, + * socket, socket lock) and can just move the notifier. + */ +static inline void +__rds_rdma_send_complete(struct rds_sock *rs, struct rds_message *rm, int status) +{ + struct rds_rdma_op *ro; + + ro = rm->m_rdma_op; + if (ro && ro->r_notify && ro->r_notifier) { + ro->r_notifier->n_status = status; + list_add_tail(&ro->r_notifier->n_list, &rs->rs_notify_queue); + ro->r_notifier = NULL; + } + + /* No need to wake the app - caller does this */ +} + +/* + * This is called from the IB send completion when we detect + * a RDMA operation that failed with remote access error. + * So speed is not an issue here. + */ +struct rds_message *rds_send_get_message(struct rds_connection *conn, + struct rds_rdma_op *op) +{ + struct rds_message *rm, *tmp, *found = NULL; + unsigned long flags; + + spin_lock_irqsave(&conn->c_lock, flags); + + list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) { + if (rm->m_rdma_op == op) { + atomic_inc(&rm->m_refcount); + found = rm; + goto out; + } + } + + list_for_each_entry_safe(rm, tmp, &conn->c_send_queue, m_conn_item) { + if (rm->m_rdma_op == op) { + atomic_inc(&rm->m_refcount); + found = rm; + break; + } + } + +out: + spin_unlock_irqrestore(&conn->c_lock, flags); + + return found; +} +EXPORT_SYMBOL_GPL(rds_send_get_message); + +/* + * This removes messages from the socket's list if they're on it. The list + * argument must be private to the caller, we must be able to modify it + * without locks. The messages must have a reference held for their + * position on the list. This function will drop that reference after + * removing the messages from the 'messages' list regardless of if it found + * the messages on the socket list or not. + */ +void rds_send_remove_from_sock(struct list_head *messages, int status) +{ + unsigned long flags = 0; /* silence gcc :P */ + struct rds_sock *rs = NULL; + struct rds_message *rm; + + local_irq_save(flags); + while (!list_empty(messages)) { + rm = list_entry(messages->next, struct rds_message, + m_conn_item); + list_del_init(&rm->m_conn_item); + + /* + * If we see this flag cleared then we're *sure* that someone + * else beat us to removing it from the sock. If we race + * with their flag update we'll get the lock and then really + * see that the flag has been cleared. + * + * The message spinlock makes sure nobody clears rm->m_rs + * while we're messing with it. It does not prevent the + * message from being removed from the socket, though. + */ + spin_lock(&rm->m_rs_lock); + if (!test_bit(RDS_MSG_ON_SOCK, &rm->m_flags)) + goto unlock_and_drop; + + if (rs != rm->m_rs) { + if (rs) { + spin_unlock(&rs->rs_lock); + rds_wake_sk_sleep(rs); + sock_put(rds_rs_to_sk(rs)); + } + rs = rm->m_rs; + spin_lock(&rs->rs_lock); + sock_hold(rds_rs_to_sk(rs)); + } + + if (test_and_clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags)) { + struct rds_rdma_op *ro = rm->m_rdma_op; + struct rds_notifier *notifier; + + list_del_init(&rm->m_sock_item); + rds_send_sndbuf_remove(rs, rm); + + if (ro && + (notifier = ro->r_notifier) != NULL && + (status || ro->r_notify)) { + list_add_tail(¬ifier->n_list, + &rs->rs_notify_queue); + if (!notifier->n_status) + notifier->n_status = status; + rm->m_rdma_op->r_notifier = NULL; + } + rds_message_put(rm); + rm->m_rs = NULL; + } + +unlock_and_drop: + spin_unlock(&rm->m_rs_lock); + rds_message_put(rm); + } + + if (rs) { + spin_unlock(&rs->rs_lock); + rds_wake_sk_sleep(rs); + sock_put(rds_rs_to_sk(rs)); + } + local_irq_restore(flags); +} + +/* + * Transports call here when they've determined that the receiver queued + * messages up to, and including, the given sequence number. Messages are + * moved to the retrans queue when rds_send_xmit picks them off the send + * queue. This means that in the TCP case, the message may not have been + * assigned the m_ack_seq yet - but that's fine as long as tcp_is_acked + * checks the RDS_MSG_HAS_ACK_SEQ bit. + * + * XXX It's not clear to me how this is safely serialized with socket + * destruction. Maybe it should bail if it sees SOCK_DEAD. + */ +void rds_send_drop_acked(struct rds_connection *conn, u64 ack, + is_acked_func is_acked) +{ + struct rds_message *rm, *tmp; + unsigned long flags; + LIST_HEAD(list); + + spin_lock_irqsave(&conn->c_lock, flags); + + list_for_each_entry_safe(rm, tmp, &conn->c_retrans, m_conn_item) { + if (!rds_send_is_acked(rm, ack, is_acked)) + break; + + list_move(&rm->m_conn_item, &list); + clear_bit(RDS_MSG_ON_CONN, &rm->m_flags); + } + + /* order flag updates with spin locks */ + if (!list_empty(&list)) + smp_mb__after_clear_bit(); + + spin_unlock_irqrestore(&conn->c_lock, flags); + + /* now remove the messages from the sock list as needed */ + rds_send_remove_from_sock(&list, RDS_RDMA_SUCCESS); +} +EXPORT_SYMBOL_GPL(rds_send_drop_acked); + +void rds_send_drop_to(struct rds_sock *rs, struct sockaddr_in *dest) +{ + struct rds_message *rm, *tmp; + struct rds_connection *conn; + unsigned long flags; + LIST_HEAD(list); + int wake = 0; + + /* get all the messages we're dropping under the rs lock */ + spin_lock_irqsave(&rs->rs_lock, flags); + + list_for_each_entry_safe(rm, tmp, &rs->rs_send_queue, m_sock_item) { + if (dest && (dest->sin_addr.s_addr != rm->m_daddr || + dest->sin_port != rm->m_inc.i_hdr.h_dport)) + continue; + + wake = 1; + list_move(&rm->m_sock_item, &list); + rds_send_sndbuf_remove(rs, rm); + clear_bit(RDS_MSG_ON_SOCK, &rm->m_flags); + + /* If this is a RDMA operation, notify the app. */ + __rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED); + } + + /* order flag updates with the rs lock */ + if (wake) + smp_mb__after_clear_bit(); + + spin_unlock_irqrestore(&rs->rs_lock, flags); + + if (wake) + rds_wake_sk_sleep(rs); + + conn = NULL; + + /* now remove the messages from the conn list as needed */ + list_for_each_entry(rm, &list, m_sock_item) { + /* We do this here rather than in the loop above, so that + * we don't have to nest m_rs_lock under rs->rs_lock */ + spin_lock(&rm->m_rs_lock); + rm->m_rs = NULL; + spin_unlock(&rm->m_rs_lock); + + /* + * If we see this flag cleared then we're *sure* that someone + * else beat us to removing it from the conn. If we race + * with their flag update we'll get the lock and then really + * see that the flag has been cleared. + */ + if (!test_bit(RDS_MSG_ON_CONN, &rm->m_flags)) + continue; + + if (conn != rm->m_inc.i_conn) { + if (conn) + spin_unlock_irqrestore(&conn->c_lock, flags); + conn = rm->m_inc.i_conn; + spin_lock_irqsave(&conn->c_lock, flags); + } + + if (test_and_clear_bit(RDS_MSG_ON_CONN, &rm->m_flags)) { + list_del_init(&rm->m_conn_item); + rds_message_put(rm); + } + } + + if (conn) + spin_unlock_irqrestore(&conn->c_lock, flags); + + while (!list_empty(&list)) { + rm = list_entry(list.next, struct rds_message, m_sock_item); + list_del_init(&rm->m_sock_item); + + rds_message_wait(rm); + rds_message_put(rm); + } +} + +/* + * we only want this to fire once so we use the callers 'queued'. It's + * possible that another thread can race with us and remove the + * message from the flow with RDS_CANCEL_SENT_TO. + */ +static int rds_send_queue_rm(struct rds_sock *rs, struct rds_connection *conn, + struct rds_message *rm, __be16 sport, + __be16 dport, int *queued) +{ + unsigned long flags; + u32 len; + + if (*queued) + goto out; + + len = be32_to_cpu(rm->m_inc.i_hdr.h_len); + + /* this is the only place which holds both the socket's rs_lock + * and the connection's c_lock */ + spin_lock_irqsave(&rs->rs_lock, flags); + + /* + * If there is a little space in sndbuf, we don't queue anything, + * and userspace gets -EAGAIN. But poll() indicates there's send + * room. This can lead to bad behavior (spinning) if snd_bytes isn't + * freed up by incoming acks. So we check the *old* value of + * rs_snd_bytes here to allow the last msg to exceed the buffer, + * and poll() now knows no more data can be sent. + */ + if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) { + rs->rs_snd_bytes += len; + + /* let recv side know we are close to send space exhaustion. + * This is probably not the optimal way to do it, as this + * means we set the flag on *all* messages as soon as our + * throughput hits a certain threshold. + */ + if (rs->rs_snd_bytes >= rds_sk_sndbuf(rs) / 2) + __set_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags); + + list_add_tail(&rm->m_sock_item, &rs->rs_send_queue); + set_bit(RDS_MSG_ON_SOCK, &rm->m_flags); + rds_message_addref(rm); + rm->m_rs = rs; + + /* The code ordering is a little weird, but we're + trying to minimize the time we hold c_lock */ + rds_message_populate_header(&rm->m_inc.i_hdr, sport, dport, 0); + rm->m_inc.i_conn = conn; + rds_message_addref(rm); + + spin_lock(&conn->c_lock); + rm->m_inc.i_hdr.h_sequence = cpu_to_be64(conn->c_next_tx_seq++); + list_add_tail(&rm->m_conn_item, &conn->c_send_queue); + set_bit(RDS_MSG_ON_CONN, &rm->m_flags); + spin_unlock(&conn->c_lock); + + rdsdebug("queued msg %p len %d, rs %p bytes %d seq %llu\n", + rm, len, rs, rs->rs_snd_bytes, + (unsigned long long)be64_to_cpu(rm->m_inc.i_hdr.h_sequence)); + + *queued = 1; + } + + spin_unlock_irqrestore(&rs->rs_lock, flags); +out: + return *queued; +} + +static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm, + struct msghdr *msg, int *allocated_mr) +{ + struct cmsghdr *cmsg; + int ret = 0; + + for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) { + if (!CMSG_OK(msg, cmsg)) + return -EINVAL; + + if (cmsg->cmsg_level != SOL_RDS) + continue; + + /* As a side effect, RDMA_DEST and RDMA_MAP will set + * rm->m_rdma_cookie and rm->m_rdma_mr. + */ + switch (cmsg->cmsg_type) { + case RDS_CMSG_RDMA_ARGS: + ret = rds_cmsg_rdma_args(rs, rm, cmsg); + break; + + case RDS_CMSG_RDMA_DEST: + ret = rds_cmsg_rdma_dest(rs, rm, cmsg); + break; + + case RDS_CMSG_RDMA_MAP: + ret = rds_cmsg_rdma_map(rs, rm, cmsg); + if (!ret) + *allocated_mr = 1; + break; + + default: + return -EINVAL; + } + + if (ret) + break; + } + + return ret; +} + +int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t payload_len) +{ + struct sock *sk = sock->sk; + struct rds_sock *rs = rds_sk_to_rs(sk); + struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name; + __be32 daddr; + __be16 dport; + struct rds_message *rm = NULL; + struct rds_connection *conn; + int ret = 0; + int queued = 0, allocated_mr = 0; + int nonblock = msg->msg_flags & MSG_DONTWAIT; + long timeo = sock_rcvtimeo(sk, nonblock); + + /* Mirror Linux UDP mirror of BSD error message compatibility */ + /* XXX: Perhaps MSG_MORE someday */ + if (msg->msg_flags & ~(MSG_DONTWAIT | MSG_CMSG_COMPAT)) { + printk(KERN_INFO "msg_flags 0x%08X\n", msg->msg_flags); + ret = -EOPNOTSUPP; + goto out; + } + + if (msg->msg_namelen) { + /* XXX fail non-unicast destination IPs? */ + if (msg->msg_namelen < sizeof(*usin) || usin->sin_family != AF_INET) { + ret = -EINVAL; + goto out; + } + daddr = usin->sin_addr.s_addr; + dport = usin->sin_port; + } else { + /* We only care about consistency with ->connect() */ + lock_sock(sk); + daddr = rs->rs_conn_addr; + dport = rs->rs_conn_port; + release_sock(sk); + } + + /* racing with another thread binding seems ok here */ + if (daddr == 0 || rs->rs_bound_addr == 0) { + ret = -ENOTCONN; /* XXX not a great errno */ + goto out; + } + + rm = rds_message_copy_from_user(msg->msg_iov, payload_len); + if (IS_ERR(rm)) { + ret = PTR_ERR(rm); + rm = NULL; + goto out; + } + + rm->m_daddr = daddr; + + /* Parse any control messages the user may have included. */ + ret = rds_cmsg_send(rs, rm, msg, &allocated_mr); + if (ret) + goto out; + + /* rds_conn_create has a spinlock that runs with IRQ off. + * Caching the conn in the socket helps a lot. */ + if (rs->rs_conn && rs->rs_conn->c_faddr == daddr) + conn = rs->rs_conn; + else { + conn = rds_conn_create_outgoing(rs->rs_bound_addr, daddr, + rs->rs_transport, + sock->sk->sk_allocation); + if (IS_ERR(conn)) { + ret = PTR_ERR(conn); + goto out; + } + rs->rs_conn = conn; + } + + if ((rm->m_rdma_cookie || rm->m_rdma_op) + && conn->c_trans->xmit_rdma == NULL) { + if (printk_ratelimit()) + printk(KERN_NOTICE "rdma_op %p conn xmit_rdma %p\n", + rm->m_rdma_op, conn->c_trans->xmit_rdma); + ret = -EOPNOTSUPP; + goto out; + } + + /* If the connection is down, trigger a connect. We may + * have scheduled a delayed reconnect however - in this case + * we should not interfere. + */ + if (rds_conn_state(conn) == RDS_CONN_DOWN + && !test_and_set_bit(RDS_RECONNECT_PENDING, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_conn_w, 0); + + ret = rds_cong_wait(conn->c_fcong, dport, nonblock, rs); + if (ret) + goto out; + + while (!rds_send_queue_rm(rs, conn, rm, rs->rs_bound_port, + dport, &queued)) { + rds_stats_inc(s_send_queue_full); + /* XXX make sure this is reasonable */ + if (payload_len > rds_sk_sndbuf(rs)) { + ret = -EMSGSIZE; + goto out; + } + if (nonblock) { + ret = -EAGAIN; + goto out; + } + + timeo = wait_event_interruptible_timeout(*sk->sk_sleep, + rds_send_queue_rm(rs, conn, rm, + rs->rs_bound_port, + dport, + &queued), + timeo); + rdsdebug("sendmsg woke queued %d timeo %ld\n", queued, timeo); + if (timeo > 0 || timeo == MAX_SCHEDULE_TIMEOUT) + continue; + + ret = timeo; + if (ret == 0) + ret = -ETIMEDOUT; + goto out; + } + + /* + * By now we've committed to the send. We reuse rds_send_worker() + * to retry sends in the rds thread if the transport asks us to. + */ + rds_stats_inc(s_send_queued); + + if (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + rds_send_worker(&conn->c_send_w.work); + + rds_message_put(rm); + return payload_len; + +out: + /* If the user included a RDMA_MAP cmsg, we allocated a MR on the fly. + * If the sendmsg goes through, we keep the MR. If it fails with EAGAIN + * or in any other way, we need to destroy the MR again */ + if (allocated_mr) + rds_rdma_unuse(rs, rds_rdma_cookie_key(rm->m_rdma_cookie), 1); + + if (rm) + rds_message_put(rm); + return ret; +} + +/* + * Reply to a ping packet. + */ +int +rds_send_pong(struct rds_connection *conn, __be16 dport) +{ + struct rds_message *rm; + unsigned long flags; + int ret = 0; + + rm = rds_message_alloc(0, GFP_ATOMIC); + if (rm == NULL) { + ret = -ENOMEM; + goto out; + } + + rm->m_daddr = conn->c_faddr; + + /* If the connection is down, trigger a connect. We may + * have scheduled a delayed reconnect however - in this case + * we should not interfere. + */ + if (rds_conn_state(conn) == RDS_CONN_DOWN + && !test_and_set_bit(RDS_RECONNECT_PENDING, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_conn_w, 0); + + ret = rds_cong_wait(conn->c_fcong, dport, 1, NULL); + if (ret) + goto out; + + spin_lock_irqsave(&conn->c_lock, flags); + list_add_tail(&rm->m_conn_item, &conn->c_send_queue); + set_bit(RDS_MSG_ON_CONN, &rm->m_flags); + rds_message_addref(rm); + rm->m_inc.i_conn = conn; + + rds_message_populate_header(&rm->m_inc.i_hdr, 0, dport, + conn->c_next_tx_seq); + conn->c_next_tx_seq++; + spin_unlock_irqrestore(&conn->c_lock, flags); + + rds_stats_inc(s_send_queued); + rds_stats_inc(s_send_pong); + + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + rds_message_put(rm); + return 0; + +out: + if (rm) + rds_message_put(rm); + return ret; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:48 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:48 -0800 Subject: [ofa-general] [PATCH 11/21] RDS: recv.c In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-12-git-send-email-andy.grover@oracle.com> Upon receiving a datagram from the transport, RDS parses the headers and potentially queues an ACK. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/recv.c | 550 +++++++++++++++++++++++++++++++++++++ 1 files changed, 550 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/recv.c diff --git a/drivers/infiniband/ulp/rds/recv.c b/drivers/infiniband/ulp/rds/recv.c new file mode 100644 index 0000000..691f8cb --- /dev/null +++ b/drivers/infiniband/ulp/rds/recv.c @@ -0,0 +1,550 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "rdma.h" + +void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, + __be32 saddr) +{ + atomic_set(&inc->i_refcount, 1); + INIT_LIST_HEAD(&inc->i_item); + inc->i_conn = conn; + inc->i_saddr = saddr; + inc->i_rdma_cookie = 0; +} +EXPORT_SYMBOL_GPL(rds_inc_init); + +void rds_inc_addref(struct rds_incoming *inc) +{ + rdsdebug("addref inc %p ref %d\n", inc, atomic_read(&inc->i_refcount)); + atomic_inc(&inc->i_refcount); +} +EXPORT_SYMBOL_GPL(rds_inc_addref); + +void rds_inc_put(struct rds_incoming *inc) +{ + rdsdebug("put inc %p ref %d\n", inc, atomic_read(&inc->i_refcount)); + if (atomic_dec_and_test(&inc->i_refcount)) { + BUG_ON(!list_empty(&inc->i_item)); + + inc->i_conn->c_trans->inc_free(inc); + } +} +EXPORT_SYMBOL_GPL(rds_inc_put); + +static void rds_recv_rcvbuf_delta(struct rds_sock *rs, struct sock *sk, + struct rds_cong_map *map, + int delta, __be16 port) +{ + int now_congested; + + if (delta == 0) + return; + + rs->rs_rcv_bytes += delta; + now_congested = rs->rs_rcv_bytes > rds_sk_rcvbuf(rs); + + rdsdebug("rs %p recv bytes %d buf %d now_cong %d\n", + rs, rs->rs_rcv_bytes, rds_sk_rcvbuf(rs), now_congested); + + rdstrace(RDS_CONG, RDS_VERBOSE, + "rs %p (%u.%u.%u.%u:%u) recv bytes %d buf %d " + "now_cong %d delta %d\n", + rs, NIPQUAD(rs->rs_bound_addr), + (int)ntohs(rs->rs_bound_port), rs->rs_rcv_bytes, + rds_sk_rcvbuf(rs), now_congested, delta); + + /* wasn't -> am congested */ + if (!rs->rs_congested && now_congested) { + rs->rs_congested = 1; + rds_cong_set_bit(map, port); + rds_cong_queue_updates(map); + } + /* was -> aren't congested */ + /* Require more free space before reporting uncongested to prevent + bouncing cong/uncong state too often */ + else if (rs->rs_congested && (rs->rs_rcv_bytes < (rds_sk_rcvbuf(rs)/2))) { + rs->rs_congested = 0; + rds_cong_clear_bit(map, port); + rds_cong_queue_updates(map); + } + + /* do nothing if no change in cong state */ +} + +/* + * Process all extension headers that come with this message. + */ +static void rds_recv_incoming_exthdrs(struct rds_incoming *inc, struct rds_sock *rs) +{ + struct rds_header *hdr = &inc->i_hdr; + unsigned int pos = 0, type, len; + union { + struct rds_ext_header_version version; + struct rds_ext_header_rdma rdma; + struct rds_ext_header_rdma_dest rdma_dest; + } buffer; + + while (1) { + len = sizeof(buffer); + type = rds_message_next_extension(hdr, &pos, &buffer, &len); + if (type == RDS_EXTHDR_NONE) + break; + /* Process extension header here */ + switch (type) { + case RDS_EXTHDR_RDMA: + rds_rdma_unuse(rs, be32_to_cpu(buffer.rdma.h_rdma_rkey), 0); + break; + + case RDS_EXTHDR_RDMA_DEST: + /* We ignore the size for now. We could stash it + * somewhere and use it for error checking. */ + inc->i_rdma_cookie = rds_rdma_make_cookie( + be32_to_cpu(buffer.rdma_dest.h_rdma_rkey), + be32_to_cpu(buffer.rdma_dest.h_rdma_offset)); + + break; + } + } +} + +/* + * The transport must make sure that this is serialized against other + * rx and conn reset on this specific conn. + * + * We currently assert that only one fragmented message will be sent + * down a connection at a time. This lets us reassemble in the conn + * instead of per-flow which means that we don't have to go digging through + * flows to tear down partial reassembly progress on conn failure and + * we save flow lookup and locking for each frag arrival. It does mean + * that small messages will wait behind large ones. Fragmenting at all + * is only to reduce the memory consumption of pre-posted buffers. + * + * The caller passes in saddr and daddr instead of us getting it from the + * conn. This lets loopback, who only has one conn for both directions, + * tell us which roles the addrs in the conn are playing for this message. + */ +void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, + struct rds_incoming *inc, gfp_t gfp, enum km_type km) +{ + struct rds_sock *rs = NULL; + struct sock *sk; + unsigned long flags; + + inc->i_conn = conn; + inc->i_rx_jiffies = jiffies; + + rdsdebug("conn %p next %llu inc %p seq %llu len %u sport %u dport %u " + "flags 0x%x rx_jiffies %lu\n", conn, + (unsigned long long)conn->c_next_rx_seq, + inc, + (unsigned long long)be64_to_cpu(inc->i_hdr.h_sequence), + be32_to_cpu(inc->i_hdr.h_len), + be16_to_cpu(inc->i_hdr.h_sport), + be16_to_cpu(inc->i_hdr.h_dport), + inc->i_hdr.h_flags, + inc->i_rx_jiffies); + + /* + * Sequence numbers should only increase. Messages get their + * sequence number as they're queued in a sending conn. They + * can be dropped, though, if the sending socket is closed before + * they hit the wire. So sequence numbers can skip forward + * under normal operation. They can also drop back in the conn + * failover case as previously sent messages are resent down the + * new instance of a conn. We drop those, otherwise we have + * to assume that the next valid seq does not come after a + * hole in the fragment stream. + * + * The headers don't give us a way to realize if fragments of + * a message have been dropped. We assume that frags that arrive + * to a flow are part of the current message on the flow that is + * being reassembled. This means that senders can't drop messages + * from the sending conn until all their frags are sent. + * + * XXX we could spend more on the wire to get more robust failure + * detection, arguably worth it to avoid data corruption. + */ + if (be64_to_cpu(inc->i_hdr.h_sequence) < conn->c_next_rx_seq + && (inc->i_hdr.h_flags & RDS_FLAG_RETRANSMITTED)) { + rds_stats_inc(s_recv_drop_old_seq); + goto out; + } + conn->c_next_rx_seq = be64_to_cpu(inc->i_hdr.h_sequence) + 1; + + if (rds_sysctl_ping_enable && inc->i_hdr.h_dport == 0) { + rds_stats_inc(s_recv_ping); + rds_send_pong(conn, inc->i_hdr.h_sport); + goto out; + } + + rs = rds_find_bound(daddr, inc->i_hdr.h_dport); + if (rs == NULL) { + rds_stats_inc(s_recv_drop_no_sock); + goto out; + } + + /* Process extension headers */ + rds_recv_incoming_exthdrs(inc, rs); + + /* We can be racing with rds_release() which marks the socket dead. */ + sk = rds_rs_to_sk(rs); + + /* serialize with rds_release -> sock_orphan */ + write_lock_irqsave(&rs->rs_recv_lock, flags); + if (!sock_flag(sk, SOCK_DEAD)) { + rdsdebug("adding inc %p to rs %p's recv queue\n", inc, rs); + rds_stats_inc(s_recv_queued); + rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, + be32_to_cpu(inc->i_hdr.h_len), + inc->i_hdr.h_dport); + rds_inc_addref(inc); + list_add_tail(&inc->i_item, &rs->rs_recv_queue); + __rds_wake_sk_sleep(sk); + } else { + rds_stats_inc(s_recv_drop_dead_sock); + } + write_unlock_irqrestore(&rs->rs_recv_lock, flags); + +out: + if (rs) + rds_sock_put(rs); +} +EXPORT_SYMBOL_GPL(rds_recv_incoming); + +/* + * be very careful here. This is being called as the condition in + * wait_event_*() needs to cope with being called many times. + */ +static int rds_next_incoming(struct rds_sock *rs, struct rds_incoming **inc) +{ + unsigned long flags; + + if (*inc == NULL) { + read_lock_irqsave(&rs->rs_recv_lock, flags); + if (!list_empty(&rs->rs_recv_queue)) { + *inc = list_entry(rs->rs_recv_queue.next, + struct rds_incoming, + i_item); + rds_inc_addref(*inc); + } + read_unlock_irqrestore(&rs->rs_recv_lock, flags); + } + + return *inc != NULL; +} + +static int rds_still_queued(struct rds_sock *rs, struct rds_incoming *inc, + int drop) +{ + struct sock *sk = rds_rs_to_sk(rs); + int ret = 0; + unsigned long flags; + + write_lock_irqsave(&rs->rs_recv_lock, flags); + if (!list_empty(&inc->i_item)) { + ret = 1; + if (drop) { + /* XXX make sure this i_conn is reliable */ + rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, + -be32_to_cpu(inc->i_hdr.h_len), + inc->i_hdr.h_dport); + list_del_init(&inc->i_item); + rds_inc_put(inc); + } + } + write_unlock_irqrestore(&rs->rs_recv_lock, flags); + + rdsdebug("inc %p rs %p still %d dropped %d\n", inc, rs, ret, drop); + return ret; +} + +/* + * Pull errors off the error queue. + * If msghdr is NULL, we will just purge the error queue. + */ +int rds_notify_queue_get(struct rds_sock *rs, struct msghdr *msghdr) +{ + struct rds_notifier *notifier; + struct rds_rdma_notify cmsg; + unsigned int count = 0, max_messages = ~0U; + unsigned long flags; + LIST_HEAD(copy); + int err = 0; + + + /* put_cmsg copies to user space and thus may sleep. We can't do this + * with rs_lock held, so first grab as many notifications as we can stuff + * in the user provided cmsg buffer. We don't try to copy more, to avoid + * losing notifications - except when the buffer is so small that it wouldn't + * even hold a single notification. Then we give him as much of this single + * msg as we can squeeze in, and set MSG_CTRUNC. + */ + if (msghdr) { + max_messages = msghdr->msg_controllen / CMSG_SPACE(sizeof(cmsg)); + if (!max_messages) + max_messages = 1; + } + + spin_lock_irqsave(&rs->rs_lock, flags); + while (!list_empty(&rs->rs_notify_queue) && count < max_messages) { + notifier = list_entry(rs->rs_notify_queue.next, + struct rds_notifier, n_list); + list_move(¬ifier->n_list, ©); + count++; + } + spin_unlock_irqrestore(&rs->rs_lock, flags); + + if (!count) + return 0; + + while (!list_empty(©)) { + notifier = list_entry(copy.next, struct rds_notifier, n_list); + + if (msghdr) { + cmsg.user_token = notifier->n_user_token; + cmsg.status = notifier->n_status; + + err = put_cmsg(msghdr, SOL_RDS, RDS_CMSG_RDMA_STATUS, + sizeof(cmsg), &cmsg); + if (err) + break; + } + + list_del_init(¬ifier->n_list); + kfree(notifier); + } + + /* If we bailed out because of an error in put_cmsg, + * we may be left with one or more notifications that we + * didn't process. Return them to the head of the list. */ + if (!list_empty(©)) { + spin_lock_irqsave(&rs->rs_lock, flags); + list_splice(©, &rs->rs_notify_queue); + spin_unlock_irqrestore(&rs->rs_lock, flags); + } + + return err; +} + +/* + * Queue a congestion notification + */ +static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr) +{ + uint64_t notify = rs->rs_cong_notify; + unsigned long flags; + int err; + + err = put_cmsg(msghdr, SOL_RDS, RDS_CMSG_CONG_UPDATE, + sizeof(notify), ¬ify); + if (err) + return err; + + spin_lock_irqsave(&rs->rs_lock, flags); + rs->rs_cong_notify &= ~notify; + spin_unlock_irqrestore(&rs->rs_lock, flags); + + return 0; +} + +/* + * Receive any control messages. + */ +static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) +{ + int ret = 0; + + if (inc->i_rdma_cookie) { + ret = put_cmsg(msg, SOL_RDS, RDS_CMSG_RDMA_DEST, + sizeof(inc->i_rdma_cookie), &inc->i_rdma_cookie); + if (ret) + return ret; + } + + return 0; +} + +int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, + size_t size, int msg_flags) +{ + struct sock *sk = sock->sk; + struct rds_sock *rs = rds_sk_to_rs(sk); + long timeo; + int ret = 0, nonblock = msg_flags & MSG_DONTWAIT; + struct sockaddr_in *sin; + struct rds_incoming *inc = NULL; + + /* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */ + timeo = sock_rcvtimeo(sk, nonblock); + + rdsdebug("size %zu flags 0x%x timeo %ld\n", size, msg_flags, timeo); + + if (msg_flags & MSG_OOB) + goto out; + + /* If there are pending notifications, do those - and nothing else */ + if (!list_empty(&rs->rs_notify_queue)) { + ret = rds_notify_queue_get(rs, msg); + goto out; + } + + if (rs->rs_cong_notify) { + ret = rds_notify_cong(rs, msg); + goto out; + } + + while (1) { + if (!rds_next_incoming(rs, &inc)) { + if (nonblock) { + ret = -EAGAIN; + break; + } + + timeo = wait_event_interruptible_timeout(*sk->sk_sleep, + rds_next_incoming(rs, &inc), + timeo); + rdsdebug("recvmsg woke inc %p timeo %ld\n", inc, + timeo); + if (timeo > 0 || timeo == MAX_SCHEDULE_TIMEOUT) + continue; + + ret = timeo; + if (ret == 0) + ret = -ETIMEDOUT; + break; + } + + rdsdebug("copying inc %p from %u.%u.%u.%u:%u to user\n", inc, + NIPQUAD(inc->i_conn->c_faddr), + ntohs(inc->i_hdr.h_sport)); + ret = inc->i_conn->c_trans->inc_copy_to_user(inc, msg->msg_iov, + size); + if (ret < 0) + break; + + /* + * if the message we just copied isn't at the head of the + * recv queue then someone else raced us to return it, try + * to get the next message. + */ + if (!rds_still_queued(rs, inc, !(msg_flags & MSG_PEEK))) { + rds_inc_put(inc); + inc = NULL; + rds_stats_inc(s_recv_deliver_raced); + continue; + } + + if (ret < be32_to_cpu(inc->i_hdr.h_len)) { + if (msg_flags & MSG_TRUNC) + ret = be32_to_cpu(inc->i_hdr.h_len); + msg->msg_flags |= MSG_TRUNC; + } + + if (rds_cmsg_recv(inc, msg)) { + ret = -EFAULT; + goto out; + } + + rds_stats_inc(s_recv_delivered); + + sin = (struct sockaddr_in *)msg->msg_name; + if (sin) { + sin->sin_family = AF_INET; + sin->sin_port = inc->i_hdr.h_sport; + sin->sin_addr.s_addr = inc->i_saddr; + memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); + } + break; + } + + if (inc) + rds_inc_put(inc); + +out: + return ret; +} + +/* + * The socket is being shut down and we're asked to drop messages that were + * queued for recvmsg. The caller has unbound the socket so the receive path + * won't queue any more incoming fragments or messages on the socket. + */ +void rds_clear_recv_queue(struct rds_sock *rs) +{ + struct sock *sk = rds_rs_to_sk(rs); + struct rds_incoming *inc, *tmp; + unsigned long flags; + + write_lock_irqsave(&rs->rs_recv_lock, flags); + list_for_each_entry_safe(inc, tmp, &rs->rs_recv_queue, i_item) { + rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, + -be32_to_cpu(inc->i_hdr.h_len), + inc->i_hdr.h_dport); + list_del_init(&inc->i_item); + rds_inc_put(inc); + } + write_unlock_irqrestore(&rs->rs_recv_lock, flags); +} + +/* + * inc->i_saddr isn't used here because it is only set in the receive + * path. + */ +void rds_inc_info_copy(struct rds_incoming *inc, + struct rds_info_iterator *iter, + __be32 saddr, __be32 daddr, int flip) +{ + struct rds_info_message minfo; + + minfo.seq = be64_to_cpu(inc->i_hdr.h_sequence); + minfo.len = be32_to_cpu(inc->i_hdr.h_len); + + if (flip) { + minfo.laddr = daddr; + minfo.faddr = saddr; + minfo.lport = inc->i_hdr.h_dport; + minfo.fport = inc->i_hdr.h_sport; + } else { + minfo.laddr = saddr; + minfo.faddr = daddr; + minfo.lport = inc->i_hdr.h_sport; + minfo.fport = inc->i_hdr.h_dport; + } + + rds_info_copy(iter, &minfo, sizeof(minfo)); +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:49 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:49 -0800 Subject: [ofa-general] [PATCH 12/21] RDS: RDMA support In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-13-git-send-email-andy.grover@oracle.com> Some transports may support RDMA features. This handles the non-transport-specific parts, like pinning user pages and tracking mapped regions. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/rdma.c | 682 +++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/rdma.h | 84 ++++ drivers/infiniband/ulp/rds/rds_rdma.h | 245 ++++++++++++ 3 files changed, 1011 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/rdma.c create mode 100644 drivers/infiniband/ulp/rds/rdma.h create mode 100644 drivers/infiniband/ulp/rds/rds_rdma.h diff --git a/drivers/infiniband/ulp/rds/rdma.c b/drivers/infiniband/ulp/rds/rdma.c new file mode 100644 index 0000000..00e3450 --- /dev/null +++ b/drivers/infiniband/ulp/rds/rdma.c @@ -0,0 +1,682 @@ +/* + * Copyright (c) 2007 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include /* for DMA_*_DEVICE */ + +#include "rdma.h" + +/* + * XXX + * - build with sparse + * - should we limit the size of a mr region? let transport return failure? + * - should we detect duplicate keys on a socket? hmm. + * - an rdma is an mlock, apply rlimit? + */ + +/* + * get the number of pages by looking at the page indices that the start and + * end addresses fall in. + * + * Returns 0 if the vec is invalid. It is invalid if the number of bytes + * causes the address to wrap or overflows an unsigned int. This comes + * from being stored in the 'length' member of 'struct scatterlist'. + */ +static unsigned int rds_pages_in_vec(struct rds_iovec *vec) +{ + if ((vec->addr + vec->bytes <= vec->addr) || + (vec->bytes > (u64)UINT_MAX)) + return 0; + + return ((vec->addr + vec->bytes + PAGE_SIZE - 1) >> PAGE_SHIFT) - + (vec->addr >> PAGE_SHIFT); +} + +static struct rds_mr *rds_mr_tree_walk(struct rb_root *root, u64 key, + struct rds_mr *insert) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct rds_mr *mr; + + while (*p) { + parent = *p; + mr = rb_entry(parent, struct rds_mr, r_rb_node); + + if (key < mr->r_key) + p = &(*p)->rb_left; + else if (key > mr->r_key) + p = &(*p)->rb_right; + else + return mr; + } + + if (insert) { + rb_link_node(&insert->r_rb_node, parent, p); + rb_insert_color(&insert->r_rb_node, root); + atomic_inc(&insert->r_refcount); + } + return NULL; +} + +/* + * Destroy the transport-specific part of a MR. + */ +static void rds_destroy_mr(struct rds_mr *mr) +{ + struct rds_sock *rs = mr->r_sock; + void *trans_private = NULL; + unsigned long flags; + + rdsdebug("RDS: destroy mr key is %x refcnt %u\n", + mr->r_key, atomic_read(&mr->r_refcount)); + + if (test_and_set_bit(RDS_MR_DEAD, &mr->r_state)) + return; + + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + if (!RB_EMPTY_NODE(&mr->r_rb_node)) + rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); + trans_private = mr->r_trans_private; + mr->r_trans_private = NULL; + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + if (trans_private) + mr->r_trans->free_mr(trans_private, mr->r_invalidate); +} + +void __rds_put_mr_final(struct rds_mr *mr) +{ + rds_destroy_mr(mr); + kfree(mr); +} + +/* + * By the time this is called we can't have any more ioctls called on + * the socket so we don't need to worry about racing with others. + */ +void rds_rdma_drop_keys(struct rds_sock *rs) +{ + struct rds_mr *mr; + struct rb_node *node; + + /* Release any MRs associated with this socket */ + while ((node = rb_first(&rs->rs_rdma_keys))) { + mr = container_of(node, struct rds_mr, r_rb_node); + if (mr->r_trans == rs->rs_transport) + mr->r_invalidate = 0; + rds_mr_put(mr); + } + + if (rs->rs_transport && rs->rs_transport->flush_mrs) + rs->rs_transport->flush_mrs(); +} + +/* + * Helper function to pin user pages. + */ +static int rds_pin_pages(unsigned long user_addr, unsigned int nr_pages, + struct page **pages, int write) +{ + int ret; + + down_read(¤t->mm->mmap_sem); + ret = get_user_pages(current, current->mm, user_addr, + nr_pages, write, 0, pages, NULL); + up_read(¤t->mm->mmap_sem); + + if (0 <= ret && (unsigned) ret < nr_pages) { + while (ret--) + put_page(pages[ret]); + ret = -EFAULT; + } + + return ret; +} + +static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, + u64 *cookie_ret, struct rds_mr **mr_ret) +{ + struct rds_mr *mr = NULL, *found; + unsigned int nr_pages; + struct page **pages = NULL; + struct scatterlist *sg; + void *trans_private; + unsigned long flags; + rds_rdma_cookie_t cookie; + unsigned int nents; + long i; + int ret; + + if (rs->rs_bound_addr == 0) { + ret = -ENOTCONN; /* XXX not a great errno */ + goto out; + } + + if (rs->rs_transport->get_mr == NULL) { + ret = -EOPNOTSUPP; + goto out; + } + + nr_pages = rds_pages_in_vec(&args->vec); + if (nr_pages == 0) { + ret = -EINVAL; + goto out; + } + + rdsdebug("RDS: get_mr addr %llx len %llu nr_pages %u\n", + args->vec.addr, args->vec.bytes, nr_pages); + + /* XXX clamp nr_pages to limit the size of this alloc? */ + pages = kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL); + if (pages == NULL) { + ret = -ENOMEM; + goto out; + } + + mr = kzalloc(sizeof(struct rds_mr), GFP_KERNEL); + if (mr == NULL) { + ret = -ENOMEM; + goto out; + } + + atomic_set(&mr->r_refcount, 1); + RB_CLEAR_NODE(&mr->r_rb_node); + mr->r_trans = rs->rs_transport; + mr->r_sock = rs; + + if (args->flags & RDS_RDMA_USE_ONCE) + mr->r_use_once = 1; + if (args->flags & RDS_RDMA_INVALIDATE) + mr->r_invalidate = 1; + if (args->flags & RDS_RDMA_READWRITE) + mr->r_write = 1; + + /* + * Pin the pages that make up the user buffer and transfer the page + * pointers to the mr's sg array. We check to see if we've mapped + * the whole region after transferring the partial page references + * to the sg array so that we can have one page ref cleanup path. + * + * For now we have no flag that tells us whether the mapping is + * r/o or r/w. We need to assume r/w, or we'll do a lot of RDMA to + * the zero page. + */ + ret = rds_pin_pages(args->vec.addr & PAGE_MASK, nr_pages, pages, 1); + if (ret < 0) + goto out; + + nents = ret; + sg = kcalloc(nents, sizeof(*sg), GFP_KERNEL); + if (sg == NULL) { + ret = -ENOMEM; + goto out; + } + + /* Stick all pages into the scatterlist */ + for (i = 0 ; i < nents; i++) { +#ifdef CONFIG_DEBUG_SG + sg[i].sg_magic = SG_MAGIC; +#endif + sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0); + } + + rdsdebug("RDS: trans_private nents is %u\n", nents); + + /* Obtain a transport specific MR. If this succeeds, the + * s/g list is now owned by the MR. + * Note that dma_map() implies that pending writes are + * flushed to RAM, so no dma_sync is needed here. */ + trans_private = rs->rs_transport->get_mr(sg, nents, rs, + &mr->r_key); + + if (IS_ERR(trans_private)) { + for (i = 0 ; i < nents; i++) + put_page(sg_page(&sg[i])); + kfree(sg); + ret = PTR_ERR(trans_private); + goto out; + } + + mr->r_trans_private = trans_private; + + rdsdebug("RDS: get_mr put_user key is %x cookie_addr %p\n", + mr->r_key, (void *)(unsigned long) args->cookie_addr); + + /* The user may pass us an unaligned address, but we can only + * map page aligned regions. So we keep the offset, and build + * a 64bit cookie containing and pass that + * around. */ + cookie = rds_rdma_make_cookie(mr->r_key, args->vec.addr & ~PAGE_MASK); + if (cookie_ret) + *cookie_ret = cookie; + + if (args->cookie_addr && put_user(cookie, (u64 __user *)(unsigned long) args->cookie_addr)) { + ret = -EFAULT; + goto out; + } + + /* Inserting the new MR into the rbtree bumps its + * reference count. */ + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + found = rds_mr_tree_walk(&rs->rs_rdma_keys, mr->r_key, mr); + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + BUG_ON(found && found != mr); + + rdsdebug("RDS: get_mr key is %x\n", mr->r_key); + if (mr_ret) { + atomic_inc(&mr->r_refcount); + *mr_ret = mr; + } + + ret = 0; +out: + kfree(pages); + if (mr) + rds_mr_put(mr); + return ret; +} + +int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen) +{ + struct rds_get_mr_args args; + + if (optlen != sizeof(struct rds_get_mr_args)) + return -EINVAL; + + if (copy_from_user(&args, (struct rds_get_mr_args __user *)optval, + sizeof(struct rds_get_mr_args))) + return -EFAULT; + + return __rds_rdma_map(rs, &args, NULL, NULL); +} + +/* + * Free the MR indicated by the given R_Key + */ +int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen) +{ + struct rds_free_mr_args args; + struct rds_mr *mr; + unsigned long flags; + + if (optlen != sizeof(struct rds_free_mr_args)) + return -EINVAL; + + if (copy_from_user(&args, (struct rds_free_mr_args __user *)optval, + sizeof(struct rds_free_mr_args))) + return -EFAULT; + + /* Special case - a null cookie means flush all unused MRs */ + if (args.cookie == 0) { + if (!rs->rs_transport || !rs->rs_transport->flush_mrs) + return -EINVAL; + rs->rs_transport->flush_mrs(); + return 0; + } + + /* Look up the MR given its R_key and remove it from the rbtree + * so nobody else finds it. + * This should also prevent races with rds_rdma_unuse. + */ + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + mr = rds_mr_tree_walk(&rs->rs_rdma_keys, rds_rdma_cookie_key(args.cookie), NULL); + if (mr) { + rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); + RB_CLEAR_NODE(&mr->r_rb_node); + if (args.flags & RDS_RDMA_INVALIDATE) + mr->r_invalidate = 1; + } + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + if (!mr) + return -EINVAL; + + /* + * call rds_destroy_mr() ourselves so that we're sure it's done by the time + * we return. If we let rds_mr_put() do it it might not happen until + * someone else drops their ref. + */ + rds_destroy_mr(mr); + rds_mr_put(mr); + return 0; +} + +/* + * This is called when we receive an extension header that + * tells us this MR was used. It allows us to implement + * use_once semantics + */ +void rds_rdma_unuse(struct rds_sock *rs, u32 r_key, int force) +{ + struct rds_mr *mr; + unsigned long flags; + int zot_me = 0; + + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + mr = rds_mr_tree_walk(&rs->rs_rdma_keys, r_key, NULL); + if (mr && (mr->r_use_once || force)) { + rb_erase(&mr->r_rb_node, &rs->rs_rdma_keys); + RB_CLEAR_NODE(&mr->r_rb_node); + zot_me = 1; + } else if (mr) + atomic_inc(&mr->r_refcount); + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + /* May have to issue a dma_sync on this memory region. + * Note we could avoid this if the operation was a RDMA READ, + * but at this point we can't tell. */ + if (mr != NULL) { + if (mr->r_trans->sync_mr) + mr->r_trans->sync_mr(mr->r_trans_private, DMA_FROM_DEVICE); + + /* If the MR was marked as invalidate, this will + * trigger an async flush. */ + if (zot_me) + rds_destroy_mr(mr); + rds_mr_put(mr); + } +} + +void rds_rdma_free_op(struct rds_rdma_op *ro) +{ + unsigned int i; + + for (i = 0; i < ro->r_nents; i++) { + struct page *page = sg_page(&ro->r_sg[i]); + + /* Mark page dirty if it was possibly modified, which + * is the case for a RDMA_READ which copies from remote + * to local memory */ + if (!ro->r_write) + set_page_dirty(page); + put_page(page); + } + + kfree(ro->r_notifier); + kfree(ro); +} + +/* + * args is a pointer to an in-kernel copy in the sendmsg cmsg. + */ +static struct rds_rdma_op *rds_rdma_prepare(struct rds_sock *rs, + struct rds_rdma_args *args) +{ + struct rds_iovec vec; + struct rds_rdma_op *op = NULL; + unsigned int nr_pages; + unsigned int max_pages; + unsigned int nr_bytes; + struct page **pages = NULL; + struct rds_iovec __user *local_vec; + struct scatterlist *sg; + unsigned int nr; + unsigned int i, j; + int ret; + + + if (rs->rs_bound_addr == 0) { + ret = -ENOTCONN; /* XXX not a great errno */ + goto out; + } + + if (args->nr_local > (u64)UINT_MAX) { + ret = -EMSGSIZE; + goto out; + } + + nr_pages = 0; + max_pages = 0; + + local_vec = (struct rds_iovec __user *)(unsigned long) args->local_vec_addr; + + /* figure out the number of pages in the vector */ + for (i = 0; i < args->nr_local; i++) { + if (copy_from_user(&vec, &local_vec[i], + sizeof(struct rds_iovec))) { + ret = -EFAULT; + goto out; + } + + nr = rds_pages_in_vec(&vec); + if (nr == 0) { + ret = -EINVAL; + goto out; + } + + max_pages = max(nr, max_pages); + nr_pages += nr; + } + + pages = kcalloc(max_pages, sizeof(struct page *), GFP_KERNEL); + if (pages == NULL) { + ret = -ENOMEM; + goto out; + } + + op = kzalloc(offsetof(struct rds_rdma_op, r_sg[nr_pages]), GFP_KERNEL); + if (op == NULL) { + ret = -ENOMEM; + goto out; + } + + op->r_write = !!(args->flags & RDS_RDMA_READWRITE); + op->r_fence = !!(args->flags & RDS_RDMA_FENCE); + op->r_notify = !!(args->flags & RDS_RDMA_NOTIFY_ME); + op->r_recverr = rs->rs_recverr; + + if (op->r_notify || op->r_recverr) { + /* We allocate an uninitialized notifier here, because + * we don't want to do that in the completion handler. We + * would have to use GFP_ATOMIC there, and don't want to deal + * with failed allocations. + */ + op->r_notifier = kmalloc(sizeof(struct rds_notifier), GFP_KERNEL); + if (!op->r_notifier) { + ret = -ENOMEM; + goto out; + } + op->r_notifier->n_user_token = args->user_token; + op->r_notifier->n_status = RDS_RDMA_SUCCESS; + } + + /* The cookie contains the R_Key of the remote memory region, and + * optionally an offset into it. This is how we implement RDMA into + * unaligned memory. + * When setting up the RDMA, we need to add that offset to the + * destination address (which is really an offset into the MR) + * FIXME: We may want to move this into ib_rdma.c + */ + op->r_key = rds_rdma_cookie_key(args->cookie); + op->r_remote_addr = args->remote_vec.addr + rds_rdma_cookie_offset(args->cookie); + + nr_bytes = 0; + + rdsdebug("RDS: rdma prepare nr_local %llu rva %llx rkey %x\n", + (unsigned long long)args->nr_local, + (unsigned long long)args->remote_vec.addr, + op->r_key); + + for (i = 0; i < args->nr_local; i++) { + if (copy_from_user(&vec, &local_vec[i], + sizeof(struct rds_iovec))) { + ret = -EFAULT; + goto out; + } + + nr = rds_pages_in_vec(&vec); + if (nr == 0) { + ret = -EINVAL; + goto out; + } + + rs->rs_user_addr = vec.addr; + rs->rs_user_bytes = vec.bytes; + + /* did the user change the vec under us? */ + if (nr > max_pages || op->r_nents + nr > nr_pages) { + ret = -EINVAL; + goto out; + } + /* If it's a WRITE operation, we want to pin the pages for reading. + * If it's a READ operation, we need to pin the pages for writing. + */ + ret = rds_pin_pages(vec.addr & PAGE_MASK, nr, pages, !op->r_write); + if (ret < 0) + goto out; + + rdsdebug("RDS: nr_bytes %u nr %u vec.bytes %llu vec.addr %llx\n", + nr_bytes, nr, vec.bytes, vec.addr); + + nr_bytes += vec.bytes; + + for (j = 0; j < nr; j++) { + unsigned int offset = vec.addr & ~PAGE_MASK; + + sg = &op->r_sg[op->r_nents + j]; +#ifdef CONFIG_DEBUG_SG + sg->sg_magic = SG_MAGIC; +#endif + sg_set_page(sg, pages[j], + min_t(unsigned int, vec.bytes, PAGE_SIZE - offset), + offset); + + rdsdebug("RDS: sg->offset %x sg->len %x vec.addr %llx vec.bytes %llu\n", + sg->offset, sg->length, vec.addr, vec.bytes); + + vec.addr += sg->length; + vec.bytes -= sg->length; + } + + op->r_nents += nr; + } + + + if (nr_bytes > args->remote_vec.bytes) { + rdsdebug("RDS nr_bytes %u remote_bytes %u do not match\n", + nr_bytes, + (unsigned int) args->remote_vec.bytes); + ret = -EINVAL; + goto out; + } + op->r_bytes = nr_bytes; + + ret = 0; +out: + kfree(pages); + if (ret) { + if (op) + rds_rdma_free_op(op); + op = ERR_PTR(ret); + } + return op; +} + +/* + * The application asks for a RDMA transfer. + * Extract all arguments and set up the rdma_op + */ +int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg) +{ + struct rds_rdma_op *op; + + if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct rds_rdma_args)) + || rm->m_rdma_op != NULL) + return -EINVAL; + + op = rds_rdma_prepare(rs, CMSG_DATA(cmsg)); + if (IS_ERR(op)) + return PTR_ERR(op); + rds_stats_inc(s_send_rdma); + rm->m_rdma_op = op; + return 0; +} + +/* + * The application wants us to pass an RDMA destination (aka MR) + * to the remote + */ +int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg) +{ + unsigned long flags; + struct rds_mr *mr; + u32 r_key; + int err = 0; + + if (cmsg->cmsg_len < CMSG_LEN(sizeof(rds_rdma_cookie_t)) + || rm->m_rdma_cookie != 0) + return -EINVAL; + + memcpy(&rm->m_rdma_cookie, CMSG_DATA(cmsg), sizeof(rm->m_rdma_cookie)); + + /* We are reusing a previously mapped MR here. Most likely, the + * application has written to the buffer, so we need to explicitly + * flush those writes to RAM. Otherwise the HCA may not see them + * when doing a DMA from that buffer. + */ + r_key = rds_rdma_cookie_key(rm->m_rdma_cookie); + + spin_lock_irqsave(&rs->rs_rdma_lock, flags); + mr = rds_mr_tree_walk(&rs->rs_rdma_keys, r_key, NULL); + if (mr == NULL) + err = -EINVAL; /* invalid r_key */ + else + atomic_inc(&mr->r_refcount); + spin_unlock_irqrestore(&rs->rs_rdma_lock, flags); + + if (mr) { + mr->r_trans->sync_mr(mr->r_trans_private, DMA_TO_DEVICE); + rm->m_rdma_mr = mr; + } + return err; +} + +/* + * The application passes us an address range it wants to enable RDMA + * to/from. We map the area, and save the pair + * in rm->m_rdma_cookie. This causes it to be sent along to the peer + * in an extension header. + */ +int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg) +{ + if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct rds_get_mr_args)) + || rm->m_rdma_cookie != 0) + return -EINVAL; + + return __rds_rdma_map(rs, CMSG_DATA(cmsg), &rm->m_rdma_cookie, &rm->m_rdma_mr); +} diff --git a/drivers/infiniband/ulp/rds/rdma.h b/drivers/infiniband/ulp/rds/rdma.h new file mode 100644 index 0000000..4255120 --- /dev/null +++ b/drivers/infiniband/ulp/rds/rdma.h @@ -0,0 +1,84 @@ +#ifndef _RDS_RDMA_H +#define _RDS_RDMA_H + +#include +#include +#include + +#include "rds.h" + +struct rds_mr { + struct rb_node r_rb_node; + atomic_t r_refcount; + u32 r_key; + + /* A copy of the creation flags */ + unsigned int r_use_once:1; + unsigned int r_invalidate:1; + unsigned int r_write:1; + + /* This is for RDS_MR_DEAD. + * It would be nice & consistent to make this part of the above + * bit field here, but we need to use test_and_set_bit. + */ + unsigned long r_state; + struct rds_sock *r_sock; /* back pointer to the socket that owns us */ + struct rds_transport *r_trans; + void *r_trans_private; +}; + +/* Flags for mr->r_state */ +#define RDS_MR_DEAD 0 + +struct rds_rdma_op { + u32 r_key; + u64 r_remote_addr; + unsigned int r_write:1; + unsigned int r_fence:1; + unsigned int r_notify:1; + unsigned int r_recverr:1; + unsigned int r_mapped:1; + struct rds_notifier *r_notifier; + unsigned int r_bytes; + unsigned int r_nents; + unsigned int r_count; + struct scatterlist r_sg[0]; +}; + +static inline rds_rdma_cookie_t rds_rdma_make_cookie(u32 r_key, u32 offset) +{ + return r_key | (((u64) offset) << 32); +} + +static inline u32 rds_rdma_cookie_key(rds_rdma_cookie_t cookie) +{ + return cookie; +} + +static inline u32 rds_rdma_cookie_offset(rds_rdma_cookie_t cookie) +{ + return cookie >> 32; +} + +int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen); +int rds_free_mr(struct rds_sock *rs, char __user *optval, int optlen); +void rds_rdma_drop_keys(struct rds_sock *rs); +int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg); +int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg); +int rds_cmsg_rdma_args(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg); +int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, + struct cmsghdr *cmsg); +void rds_rdma_free_op(struct rds_rdma_op *ro); +void rds_rdma_send_complete(struct rds_message *rm, int); + +extern void __rds_put_mr_final(struct rds_mr *mr); +static inline void rds_mr_put(struct rds_mr *mr) +{ + if (atomic_dec_and_test(&mr->r_refcount)) + __rds_put_mr_final(mr); +} + +#endif diff --git a/drivers/infiniband/ulp/rds/rds_rdma.h b/drivers/infiniband/ulp/rds/rds_rdma.h new file mode 100644 index 0000000..565482d --- /dev/null +++ b/drivers/infiniband/ulp/rds/rds_rdma.h @@ -0,0 +1,245 @@ +/* + * Copyright (c) 2008 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef IB_RDS_H +#define IB_RDS_H + +#include + +/* These sparse annotated types shouldn't be in any user + * visible header file. We should clean this up rather + * than kludging around them. */ +#ifndef __KERNEL__ +#define __be16 u_int16_t +#define __be32 u_int32_t +#define __be64 u_int64_t +#endif + +#define RDS_IB_ABI_VERSION 0x301 + +/* + * setsockopt/getsockopt for SOL_RDS + */ +#define RDS_CANCEL_SENT_TO 1 +#define RDS_GET_MR 2 +#define RDS_FREE_MR 3 +/* deprecated: RDS_BARRIER 4 */ +#define RDS_RECVERR 5 +#define RDS_CONG_MONITOR 6 + +/* + * Control message types for SOL_RDS. + * + * CMSG_RDMA_ARGS (sendmsg) + * Request a RDMA transfer to/from the specified + * memory ranges. + * The cmsg_data is a struct rds_rdma_args. + * RDS_CMSG_RDMA_DEST (recvmsg, sendmsg) + * Kernel informs application about intended + * source/destination of a RDMA transfer + * RDS_CMSG_RDMA_MAP (sendmsg) + * Application asks kernel to map the given + * memory range into a IB MR, and send the + * R_Key along in an RDS extension header. + * The cmsg_data is a struct rds_get_mr_args, + * the same as for the GET_MR setsockopt. + * RDS_CMSG_RDMA_STATUS (recvmsg) + * Returns the status of a completed RDMA operation. + */ +#define RDS_CMSG_RDMA_ARGS 1 +#define RDS_CMSG_RDMA_DEST 2 +#define RDS_CMSG_RDMA_MAP 3 +#define RDS_CMSG_RDMA_STATUS 4 +#define RDS_CMSG_CONG_UPDATE 5 + +#define RDS_INFO_COUNTERS 10000 +#define RDS_INFO_CONNECTIONS 10001 +/* 10002 aka RDS_INFO_FLOWS is deprecated */ +#define RDS_INFO_SEND_MESSAGES 10003 +#define RDS_INFO_RETRANS_MESSAGES 10004 +#define RDS_INFO_RECV_MESSAGES 10005 +#define RDS_INFO_SOCKETS 10006 +#define RDS_INFO_TCP_SOCKETS 10007 +#define RDS_INFO_IB_CONNECTIONS 10008 +#define RDS_INFO_CONNECTION_STATS 10009 + +struct rds_info_counter { + u_int8_t name[32]; + u_int64_t value; +} __attribute__((packed)); + +#define RDS_INFO_CONNECTION_FLAG_SENDING 0x01 +#define RDS_INFO_CONNECTION_FLAG_CONNECTING 0x02 +#define RDS_INFO_CONNECTION_FLAG_CONNECTED 0x04 + +struct rds_info_connection { + u_int64_t next_tx_seq; + u_int64_t next_rx_seq; + __be32 laddr; + __be32 faddr; + u_int8_t transport[15]; /* null term ascii */ + u_int8_t flags; +} __attribute__((packed)); + +struct rds_info_flow { + __be32 laddr; + __be32 faddr; + u_int32_t bytes; + __be16 lport; + __be16 fport; +} __attribute__((packed)); + +#define RDS_INFO_MESSAGE_FLAG_ACK 0x01 +#define RDS_INFO_MESSAGE_FLAG_FAST_ACK 0x02 + +struct rds_info_message { + u_int64_t seq; + u_int32_t len; + __be32 laddr; + __be32 faddr; + __be16 lport; + __be16 fport; + u_int8_t flags; +} __attribute__((packed)); + +struct rds_info_socket { + u_int32_t sndbuf; + __be32 bound_addr; + __be32 connected_addr; + __be16 bound_port; + __be16 connected_port; + u_int32_t rcvbuf; + u_int64_t inum; +} __attribute__((packed)); + +#define RDS_IB_GID_LEN 16 +struct rds_info_rdma_connection { + __be32 src_addr; + __be32 dst_addr; + uint8_t src_gid[RDS_IB_GID_LEN]; + uint8_t dst_gid[RDS_IB_GID_LEN]; + + uint32_t max_send_wr; + uint32_t max_recv_wr; + uint32_t max_send_sge; + uint32_t rdma_mr_max; + uint32_t rdma_mr_size; +}; + +/* + * Congestion monitoring. + * Congestion control in RDS happens at the host connection + * level by exchanging a bitmap marking congested ports. + * By default, a process sleeping in poll() is always woken + * up when the congestion map is updated. + * With explicit monitoring, an application can have more + * fine-grained control. + * The application installs a 64bit mask value in the socket, + * where each bit corresponds to a group of ports. + * When a congestion update arrives, RDS checks the set of + * ports that are now uncongested against the list bit mask + * installed in the socket, and if they overlap, we queue a + * cong_notification on the socket. + * + * To install the congestion monitor bitmask, use RDS_CONG_MONITOR + * with the 64bit mask. + * Congestion updates are received via RDS_CMSG_CONG_UPDATE + * control messages. + * + * The correspondence between bits and ports is + * 1 << (portnum % 64) + */ +#define RDS_CONG_MONITOR_SIZE 64 +#define RDS_CONG_MONITOR_BIT(port) (((unsigned int) port) % RDS_CONG_MONITOR_SIZE) +#define RDS_CONG_MONITOR_MASK(port) (1ULL << RDS_CONG_MONITOR_BIT(port)) + +/* + * RDMA related types + */ + +/* + * This encapsulates a remote memory location. + * In the current implementation, it contains the R_Key + * of the remote memory region, and the offset into it + * (so that the application does not have to worry about + * alignment). + */ +typedef u_int64_t rds_rdma_cookie_t; + +struct rds_iovec { + u_int64_t addr; + u_int64_t bytes; +}; + +struct rds_get_mr_args { + struct rds_iovec vec; + u_int64_t cookie_addr; + uint64_t flags; +}; + +struct rds_free_mr_args { + rds_rdma_cookie_t cookie; + u_int64_t flags; +}; + +struct rds_rdma_args { + rds_rdma_cookie_t cookie; + struct rds_iovec remote_vec; + u_int64_t local_vec_addr; + u_int64_t nr_local; + u_int64_t flags; + u_int64_t user_token; +}; + +struct rds_rdma_notify { + u_int64_t user_token; + int32_t status; +}; + +#define RDS_RDMA_SUCCESS 0 +#define RDS_RDMA_REMOTE_ERROR 1 +#define RDS_RDMA_CANCELED 2 +#define RDS_RDMA_DROPPED 3 +#define RDS_RDMA_OTHER_ERROR 4 + +/* + * Common set of flags for all RDMA related structs + */ +#define RDS_RDMA_READWRITE 0x0001 +#define RDS_RDMA_FENCE 0x0002 /* use FENCE for immediate send */ +#define RDS_RDMA_INVALIDATE 0x0004 /* invalidate R_Key after freeing MR */ +#define RDS_RDMA_USE_ONCE 0x0008 /* free MR after use */ +#define RDS_RDMA_DONTWAIT 0x0010 /* Don't wait in SET_BARRIER */ +#define RDS_RDMA_NOTIFY_ME 0x0020 /* Notify when operation completes */ + +#endif /* IB_RDS_H */ -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:50 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:50 -0800 Subject: [ofa-general] [PATCH 13/21] RDS/IB: Infiniband transport In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-14-git-send-email-andy.grover@oracle.com> Registers as an RDS transport and an IB client, and uses IB CM API to allocate ids, queue pairs, and the rest of that fun stuff. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/ib.c | 312 +++++++++++++ drivers/infiniband/ulp/rds/ib.h | 358 +++++++++++++++ drivers/infiniband/ulp/rds/ib_cm.c | 882 ++++++++++++++++++++++++++++++++++++ 3 files changed, 1552 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib.c create mode 100644 drivers/infiniband/ulp/rds/ib.h create mode 100644 drivers/infiniband/ulp/rds/ib_cm.c diff --git a/drivers/infiniband/ulp/rds/ib.c b/drivers/infiniband/ulp/rds/ib.c new file mode 100644 index 0000000..cd35fba --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib.c @@ -0,0 +1,312 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include +#include +#include +#include + +#include "rds.h" +#include "ib.h" + +unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE; +unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned MRs */ + +module_param(fmr_pool_size, int, 0444); +MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA"); +module_param(fmr_message_size, int, 0444); +MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer"); + +struct list_head rds_ib_devices; + +DEFINE_SPINLOCK(ib_nodev_conns_lock); +LIST_HEAD(ib_nodev_conns); + +void rds_ib_add_one(struct ib_device *device) +{ + struct rds_ib_device *rds_ibdev; + struct ib_device_attr *dev_attr; + + /* Only handle IB (no iWARP) devices */ + if (device->node_type != RDMA_NODE_IB_CA) + return; + + dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL); + if (!dev_attr) + return; + + if (ib_query_device(device, dev_attr)) { + rdsdebug("Query device failed for %s\n", device->name); + goto free_attr; + } + + rds_ibdev = kmalloc(sizeof *rds_ibdev, GFP_KERNEL); + if (!rds_ibdev) + goto free_attr; + + spin_lock_init(&rds_ibdev->spinlock); + + rds_ibdev->max_wrs = dev_attr->max_qp_wr; + rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE); + + rds_ibdev->fmr_page_shift = max(9, ffs(dev_attr->page_size_cap) - 1); + rds_ibdev->fmr_page_size = 1 << rds_ibdev->fmr_page_shift; + rds_ibdev->fmr_page_mask = ~((u64) rds_ibdev->fmr_page_size - 1); + rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32; + rds_ibdev->max_fmrs = dev_attr->max_fmr ? + min_t(unsigned int, dev_attr->max_fmr, fmr_pool_size) : + fmr_pool_size; + + rds_ibdev->dev = device; + rds_ibdev->pd = ib_alloc_pd(device); + if (IS_ERR(rds_ibdev->pd)) + goto free_dev; + + rds_ibdev->mr = ib_get_dma_mr(rds_ibdev->pd, + IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(rds_ibdev->mr)) + goto err_pd; + + rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev); + if (IS_ERR(rds_ibdev->mr_pool)) { + rds_ibdev->mr_pool = NULL; + goto err_mr; + } + + INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); + INIT_LIST_HEAD(&rds_ibdev->conn_list); + list_add_tail(&rds_ibdev->list, &rds_ib_devices); + + ib_set_client_data(device, &rds_ib_client, rds_ibdev); + + goto free_attr; + +err_mr: + ib_dereg_mr(rds_ibdev->mr); +err_pd: + ib_dealloc_pd(rds_ibdev->pd); +free_dev: + kfree(rds_ibdev); +free_attr: + kfree(dev_attr); +} + +void rds_ib_remove_one(struct ib_device *device) +{ + struct rds_ib_device *rds_ibdev; + struct rds_ib_ipaddr *i_ipaddr, *i_next; + + rds_ibdev = ib_get_client_data(device, &rds_ib_client); + if (!rds_ibdev) + return; + + list_for_each_entry_safe(i_ipaddr, i_next, &rds_ibdev->ipaddr_list, list) { + list_del(&i_ipaddr->list); + kfree(i_ipaddr); + } + + rds_ib_remove_conns(rds_ibdev); + + if (rds_ibdev->mr_pool) + rds_ib_destroy_mr_pool(rds_ibdev->mr_pool); + + ib_dereg_mr(rds_ibdev->mr); + + while (ib_dealloc_pd(rds_ibdev->pd)) { + rdsdebug("%s-%d Failed to dealloc pd %p\n", __func__, __LINE__, rds_ibdev->pd); + msleep(1); + } + + list_del(&rds_ibdev->list); + kfree(rds_ibdev); +} + +struct ib_client rds_ib_client = { + .name = "rds_ib", + .add = rds_ib_add_one, + .remove = rds_ib_remove_one +}; + +static int rds_ib_conn_info_visitor(struct rds_connection *conn, + void *buffer) +{ + struct rds_info_rdma_connection *iinfo = buffer; + struct rds_ib_connection *ic; + + /* We will only ever look at IB transports */ + if (conn->c_trans != &rds_ib_transport) + return 0; + + iinfo->src_addr = conn->c_laddr; + iinfo->dst_addr = conn->c_faddr; + + memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid)); + memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid)); + if (rds_conn_state(conn) == RDS_CONN_UP) { + struct rds_ib_device *rds_ibdev; + struct rdma_dev_addr *dev_addr; + + ic = conn->c_transport_data; + dev_addr = &ic->i_cm_id->route.addr.dev_addr; + + ib_addr_get_sgid(dev_addr, (union ib_gid *) &iinfo->src_gid); + ib_addr_get_dgid(dev_addr, (union ib_gid *) &iinfo->dst_gid); + + rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); + iinfo->max_send_wr = ic->i_send_ring.w_nr; + iinfo->max_recv_wr = ic->i_recv_ring.w_nr; + iinfo->max_send_sge = rds_ibdev->max_sge; + rds_ib_get_mr_info(rds_ibdev, iinfo); + } + return 1; +} + +static void rds_ib_ic_info(struct socket *sock, unsigned int len, + struct rds_info_iterator *iter, + struct rds_info_lengths *lens) +{ + rds_for_each_conn_info(sock, len, iter, lens, + rds_ib_conn_info_visitor, + sizeof(struct rds_info_rdma_connection)); +} + + +/* + * Early RDS/IB was built to only bind to an address if there is an IPoIB + * device with that address set. + * + * If it were me, I'd advocate for something more flexible. Sending and + * receiving should be device-agnostic. Transports would try and maintain + * connections between peers who have messages queued. Userspace would be + * allowed to influence which paths have priority. We could call userspace + * asserting this policy "routing". + */ +static int rds_ib_laddr_check(__be32 addr) +{ + struct net_device *dev; + int ret; + + dev = ip_dev_find(&init_net, addr); + if (dev && dev->type == ARPHRD_INFINIBAND) { + dev_put(dev); + ret = 0; + } else + ret = -EADDRNOTAVAIL; + + rdsdebug("addr %u.%u.%u.%u ret %d\n", NIPQUAD(addr), ret); + + return ret; +} + +void rds_ib_exit(void) +{ + rds_info_deregister_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info); + rds_ib_listen_stop(); + rds_ib_remove_nodev_conns(); + ib_unregister_client(&rds_ib_client); + rds_ib_sysctl_exit(); + rds_ib_recv_exit(); + rds_trans_unregister(&rds_ib_transport); +} + +struct rds_transport rds_ib_transport = { + .laddr_check = rds_ib_laddr_check, + .xmit_complete = rds_ib_xmit_complete, + .xmit = rds_ib_xmit, + .xmit_cong_map = NULL, + .xmit_rdma = rds_ib_xmit_rdma, + .recv = rds_ib_recv, + .conn_alloc = rds_ib_conn_alloc, + .conn_free = rds_ib_conn_free, + .conn_connect = rds_ib_conn_connect, + .conn_shutdown = rds_ib_conn_shutdown, + .inc_copy_to_user = rds_ib_inc_copy_to_user, + .inc_purge = rds_ib_inc_purge, + .inc_free = rds_ib_inc_free, + .listen_stop = rds_ib_listen_stop, + .stats_info_copy = rds_ib_stats_info_copy, + .exit = rds_ib_exit, + .get_mr = rds_ib_get_mr, + .sync_mr = rds_ib_sync_mr, + .free_mr = rds_ib_free_mr, + .flush_mrs = rds_ib_flush_mrs, + .t_owner = THIS_MODULE, + .t_name = "infiniband", +}; + +int __init rds_ib_init(void) +{ + int ret; + + INIT_LIST_HEAD(&rds_ib_devices); + + ret = ib_register_client(&rds_ib_client); + if (ret) + goto out; + + ret = rds_ib_sysctl_init(); + if (ret) + goto out_ibreg; + + ret = rds_ib_recv_init(); + if (ret) + goto out_sysctl; + + ret = rds_trans_register(&rds_ib_transport); + if (ret) + goto out_recv; + + ret = rds_ib_listen_init(); + if (ret) + goto out_register; + + rds_info_register_func(RDS_INFO_IB_CONNECTIONS, rds_ib_ic_info); + + goto out; + +out_register: + rds_trans_unregister(&rds_ib_transport); +out_recv: + rds_ib_recv_exit(); +out_sysctl: + rds_ib_sysctl_exit(); +out_ibreg: + ib_unregister_client(&rds_ib_client); +out: + return ret; +} + +MODULE_LICENSE("GPL"); + diff --git a/drivers/infiniband/ulp/rds/ib.h b/drivers/infiniband/ulp/rds/ib.h new file mode 100644 index 0000000..eff70f0 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib.h @@ -0,0 +1,358 @@ +#ifndef _RDS_IB_H +#define _RDS_IB_H + +#include +#include +#include "rds.h" + +#define RDS_IB_RESOLVE_TIMEOUT_MS 5000 + +#define RDS_FMR_SIZE 256 +#define RDS_FMR_POOL_SIZE 4096 + +#define RDS_IB_MAX_SGE 8 +#define RDS_IB_RECV_SGE 2 + +#define RDS_IB_DEFAULT_RECV_WR 1024 +#define RDS_IB_DEFAULT_SEND_WR 256 + +#define RDS_IB_SUPPORTED_PROTOCOLS 0x00000003 /* minor versions supported */ + +extern struct list_head rds_ib_devices; + +/* + * IB posts RDS_FRAG_SIZE fragments of pages to the receive queues to + * try and minimize the amount of memory tied up both the device and + * socket receive queues. + */ +/* page offset of the final full frag that fits in the page */ +#define RDS_PAGE_LAST_OFF (((PAGE_SIZE / RDS_FRAG_SIZE) - 1) * RDS_FRAG_SIZE) +struct rds_page_frag { + struct list_head f_item; + struct page *f_page; + unsigned long f_offset; + dma_addr_t f_mapped; +}; + +struct rds_ib_incoming { + struct list_head ii_frags; + struct rds_incoming ii_inc; +}; + +struct rds_ib_connect_private { + /* Add new fields at the end, and don't permute existing fields. */ + __be32 dp_saddr; + __be32 dp_daddr; + u8 dp_protocol_major; + u8 dp_protocol_minor; + __be16 dp_protocol_minor_mask; /* bitmask */ + __be32 dp_reserved1; + __be64 dp_ack_seq; + __be32 dp_credit; /* non-zero enables flow ctl */ +}; + +struct rds_ib_send_work { + struct rds_message *s_rm; + struct rds_rdma_op *s_op; + struct ib_send_wr s_wr; + struct ib_sge s_sge[RDS_IB_MAX_SGE]; + unsigned long s_queued; +}; + +struct rds_ib_recv_work { + struct rds_ib_incoming *r_ibinc; + struct rds_page_frag *r_frag; + struct ib_recv_wr r_wr; + struct ib_sge r_sge[2]; +}; + +struct rds_ib_work_ring { + u32 w_nr; + u32 w_alloc_ptr; + u32 w_alloc_ctr; + u32 w_free_ptr; + atomic_t w_free_ctr; +}; + +struct rds_ib_device; + +struct rds_ib_connection { + + struct list_head ib_node; + struct rds_ib_device *rds_ibdev; + struct rds_connection *conn; + + /* alphabet soup, IBTA style */ + struct rdma_cm_id *i_cm_id; + struct ib_pd *i_pd; + struct ib_mr *i_mr; + struct ib_cq *i_send_cq; + struct ib_cq *i_recv_cq; + + /* tx */ + struct rds_ib_work_ring i_send_ring; + struct rds_message *i_rm; + struct rds_header *i_send_hdrs; + u64 i_send_hdrs_dma; + struct rds_ib_send_work *i_sends; + + /* rx */ + struct mutex i_recv_mutex; + struct rds_ib_work_ring i_recv_ring; + struct rds_ib_incoming *i_ibinc; + u32 i_recv_data_rem; + struct rds_header *i_recv_hdrs; + u64 i_recv_hdrs_dma; + struct rds_ib_recv_work *i_recvs; + struct rds_page_frag i_frag; + u64 i_ack_recv; /* last ACK received */ + + /* sending acks */ + unsigned long i_ack_flags; +#ifndef KERNEL_HAS_ATOMIC64 + spinlock_t i_ack_lock; /* protect i_ack_next */ + u64 i_ack_next; /* next ACK to send */ +#else + atomic64_t i_ack_next; /* next ACK to send */ +#endif + struct rds_header *i_ack; + struct ib_send_wr i_ack_wr; + struct ib_sge i_ack_sge; + u64 i_ack_dma; + unsigned long i_ack_queued; + + /* Flow control related information + * + * Our algorithm uses a pair variables that we need to access + * atomically - one for the send credits, and one posted + * recv credits we need to transfer to remote. + * Rather than protect them using a slow spinlock, we put both into + * a single atomic_t and update it using cmpxchg + */ + atomic_t i_credits; + + /* Protocol version specific information */ + unsigned int i_flowctl:1; /* enable/disable flow ctl */ + + /* Batched completions */ + unsigned int i_unsignaled_wrs; + long i_unsignaled_bytes; +}; + +/* This assumes that atomic_t is at least 32 bits */ +#define IB_GET_SEND_CREDITS(v) ((v) & 0xffff) +#define IB_GET_POST_CREDITS(v) ((v) >> 16) +#define IB_SET_SEND_CREDITS(v) ((v) & 0xffff) +#define IB_SET_POST_CREDITS(v) ((v) << 16) + +struct rds_ib_ipaddr { + struct list_head list; + __be32 ipaddr; +}; + +struct rds_ib_device { + struct list_head list; + struct list_head ipaddr_list; + struct list_head conn_list; + struct ib_device *dev; + struct ib_pd *pd; + struct ib_mr *mr; + struct rds_ib_mr_pool *mr_pool; + int fmr_page_shift; + int fmr_page_size; + u64 fmr_page_mask; + unsigned int fmr_max_remaps; + unsigned int max_fmrs; + int max_sge; + unsigned int max_wrs; + spinlock_t spinlock; /* protect the above */ +}; + +/* bits for i_ack_flags */ +#define IB_ACK_IN_FLIGHT 0 +#define IB_ACK_REQUESTED 1 + +/* Magic WR_ID for ACKs */ +#define RDS_IB_ACK_WR_ID (~(u64) 0) + +struct rds_ib_statistics { + uint64_t s_ib_connect_raced; + uint64_t s_ib_listen_closed_stale; + uint64_t s_ib_tx_cq_call; + uint64_t s_ib_tx_cq_event; + uint64_t s_ib_tx_ring_full; + uint64_t s_ib_tx_throttle; + uint64_t s_ib_tx_sg_mapping_failure; + uint64_t s_ib_tx_stalled; + uint64_t s_ib_tx_credit_updates; + uint64_t s_ib_rx_cq_call; + uint64_t s_ib_rx_cq_event; + uint64_t s_ib_rx_ring_empty; + uint64_t s_ib_rx_refill_from_cq; + uint64_t s_ib_rx_refill_from_thread; + uint64_t s_ib_rx_alloc_limit; + uint64_t s_ib_rx_credit_updates; + uint64_t s_ib_ack_sent; + uint64_t s_ib_ack_send_failure; + uint64_t s_ib_ack_send_delayed; + uint64_t s_ib_ack_send_piggybacked; + uint64_t s_ib_ack_received; + uint64_t s_ib_rdma_mr_alloc; + uint64_t s_ib_rdma_mr_free; + uint64_t s_ib_rdma_mr_used; + uint64_t s_ib_rdma_mr_pool_flush; + uint64_t s_ib_rdma_mr_pool_wait; + uint64_t s_ib_rdma_mr_pool_depleted; +}; + +extern struct workqueue_struct *rds_ib_wq; + +/* + * Fake ib_dma_sync_sg_for_{cpu,device} as long as ib_verbs.h + * doesn't define it. + */ +static inline void rds_ib_dma_sync_sg_for_cpu(struct ib_device *dev, + struct scatterlist *sg, unsigned int sg_dma_len, int direction) +{ + unsigned int i; + + for (i = 0; i < sg_dma_len; ++i) { + ib_dma_sync_single_for_cpu(dev, + ib_sg_dma_address(dev, &sg[i]), + ib_sg_dma_len(dev, &sg[i]), + direction); + } +} +#define ib_dma_sync_sg_for_cpu rds_ib_dma_sync_sg_for_cpu + +static inline void rds_ib_dma_sync_sg_for_device(struct ib_device *dev, + struct scatterlist *sg, unsigned int sg_dma_len, int direction) +{ + unsigned int i; + + for (i = 0; i < sg_dma_len; ++i) { + ib_dma_sync_single_for_device(dev, + ib_sg_dma_address(dev, &sg[i]), + ib_sg_dma_len(dev, &sg[i]), + direction); + } +} +#define ib_dma_sync_sg_for_device rds_ib_dma_sync_sg_for_device + + +/* ib.c */ +extern struct rds_transport rds_ib_transport; +extern void rds_ib_add_one(struct ib_device *device); +extern void rds_ib_remove_one(struct ib_device *device); +extern struct ib_client rds_ib_client; + +extern unsigned int fmr_pool_size; +extern unsigned int fmr_message_size; + +extern spinlock_t ib_nodev_conns_lock; +extern struct list_head ib_nodev_conns; + +/* ib_cm.c */ +int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp); +void rds_ib_conn_free(void *arg); +int rds_ib_conn_connect(struct rds_connection *conn); +void rds_ib_conn_shutdown(struct rds_connection *conn); +void rds_ib_state_change(struct sock *sk); +int __init rds_ib_listen_init(void); +void rds_ib_listen_stop(void); +void __rds_ib_conn_error(struct rds_connection *conn, const char *, ...); + +#define rds_ib_conn_error(conn, fmt...) \ + __rds_ib_conn_error(conn, KERN_WARNING "RDS/IB: " fmt) + +/* ib_rdma.c */ +int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr); +int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); +void rds_ib_remove_nodev_conns(void); +void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev); +struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *); +void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_connection *iinfo); +void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *); +void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, + struct rds_sock *rs, u32 *key_ret); +void rds_ib_sync_mr(void *trans_private, int dir); +void rds_ib_free_mr(void *trans_private, int invalidate); +void rds_ib_flush_mrs(void); + +/* ib_recv.c */ +int __init rds_ib_recv_init(void); +void rds_ib_recv_exit(void); +int rds_ib_recv(struct rds_connection *conn); +int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, + gfp_t page_gfp, int prefill); +void rds_ib_inc_purge(struct rds_incoming *inc); +void rds_ib_inc_free(struct rds_incoming *inc); +int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *iov, + size_t size); +void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context); +void rds_ib_recv_init_ring(struct rds_ib_connection *ic); +void rds_ib_recv_clear_ring(struct rds_ib_connection *ic); +void rds_ib_recv_init_ack(struct rds_ib_connection *ic); +void rds_ib_attempt_ack(struct rds_ib_connection *ic); +void rds_ib_ack_send_complete(struct rds_ib_connection *ic); +u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic); + +/* ib_ring.c */ +void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr); +void rds_ib_ring_resize(struct rds_ib_work_ring *ring, u32 nr); +u32 rds_ib_ring_alloc(struct rds_ib_work_ring *ring, u32 val, u32 *pos); +void rds_ib_ring_free(struct rds_ib_work_ring *ring, u32 val); +void rds_ib_ring_unalloc(struct rds_ib_work_ring *ring, u32 val); +int rds_ib_ring_empty(struct rds_ib_work_ring *ring); +int rds_ib_ring_low(struct rds_ib_work_ring *ring); +u32 rds_ib_ring_oldest(struct rds_ib_work_ring *ring); +u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest); +extern wait_queue_head_t rds_ib_ring_empty_wait; + +/* ib_send.c */ +void rds_ib_xmit_complete(struct rds_connection *conn); +int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, unsigned int off); +void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context); +void rds_ib_send_init_ring(struct rds_ib_connection *ic); +void rds_ib_send_clear_ring(struct rds_ib_connection *ic); +int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op); +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits); +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted); +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, u32 wanted, + u32 *adv_credits); + +/* ib_stats.c */ +DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats); +#define rds_ib_stats_inc(member) rds_stats_inc_which(rds_ib_stats, member) +unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail); + +/* ib_sysctl.c */ +int __init rds_ib_sysctl_init(void); +void rds_ib_sysctl_exit(void); +extern unsigned long rds_ib_sysctl_max_send_wr; +extern unsigned long rds_ib_sysctl_max_recv_wr; +extern unsigned long rds_ib_sysctl_max_unsig_wrs; +extern unsigned long rds_ib_sysctl_max_unsig_bytes; +extern unsigned long rds_ib_sysctl_max_recv_allocation; +extern unsigned int rds_ib_sysctl_flow_control; +extern ctl_table rds_ib_sysctl_table[]; + +/* + * Helper functions for getting/setting the header and data SGEs in + * RDS packets (not RDMA) + */ +static inline struct ib_sge * +rds_ib_header_sge(struct rds_ib_connection *ic, struct ib_sge *sge) +{ + return &sge[0]; +} + +static inline struct ib_sge * +rds_ib_data_sge(struct rds_ib_connection *ic, struct ib_sge *sge) +{ + return &sge[1]; +} + +#endif diff --git a/drivers/infiniband/ulp/rds/ib_cm.c b/drivers/infiniband/ulp/rds/ib_cm.c new file mode 100644 index 0000000..870070d --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_cm.c @@ -0,0 +1,882 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "ib.h" + +static struct rdma_cm_id *rds_ib_listen_id; + +/* + * Set the selected protocol version + */ +static void rds_ib_set_protocol(struct rds_connection *conn, unsigned int version) +{ + conn->c_version = version; +} + +/* + * Set up flow control + */ +static void rds_ib_set_flow_control(struct rds_connection *conn, u32 credits) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (rds_ib_sysctl_flow_control && credits != 0) { + /* We're doing flow control */ + ic->i_flowctl = 1; + rds_ib_send_add_credits(conn, credits); + } else { + ic->i_flowctl = 0; + } +} + +/* + * Tune RNR behavior. Without flow control, we use a rather + * low timeout, but not the absolute minimum - this should + * be tunable. + * + * We already set the RNR retry count to 7 (which is the + * smallest infinite number :-) above. + * If flow control is off, we want to change this back to 0 + * so that we learn quickly when our credit accounting is + * buggy. + * + * Caller passes in a qp_attr pointer - don't waste stack spacv + * by allocation this twice. + */ +static void +rds_ib_tune_rnr(struct rds_ib_connection *ic, struct ib_qp_attr *attr) +{ + int ret; + + if (ic->i_flowctl) { + /* It seems we have to take a brief detour through SQD state + * in order to change the RNR retry count. */ + attr->qp_state = IB_QPS_SQD; + ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_STATE); + if (ret) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, SQD): err=%d\n", -ret); + + attr->rnr_retry = 0; + ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_RNR_RETRY); + if (ret) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_RNR_RETRY, 0): err=%d\n", -ret); + } else { + attr->min_rnr_timer = IB_RNR_TIMER_000_32; + ret = ib_modify_qp(ic->i_cm_id->qp, attr, IB_QP_MIN_RNR_TIMER); + if (ret) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_MIN_RNR_TIMER): err=%d\n", -ret); + } +} + +/* + * Connection established. + * We get here for both outgoing and incoming connection. + */ +static void rds_ib_connect_complete(struct rds_connection *conn, struct rdma_cm_event *event) +{ + const struct rds_ib_connect_private *dp = NULL; + struct rds_ib_connection *ic = conn->c_transport_data; + struct rds_ib_device *rds_ibdev; + struct ib_qp_attr qp_attr; + int err; + + if (event->param.conn.private_data_len) { + dp = event->param.conn.private_data; + + rds_ib_set_protocol(conn, + RDS_PROTOCOL(dp->dp_protocol_major, + dp->dp_protocol_minor)); + rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); + } + + printk(KERN_NOTICE "RDS/IB: connected to %u.%u.%u.%u version %u.%u%s\n", + NIPQUAD(conn->c_laddr), + RDS_PROTOCOL_MAJOR(conn->c_version), + RDS_PROTOCOL_MINOR(conn->c_version), + ic->i_flowctl ? ", flow control" : ""); + + /* Tune RNR behavior */ + rds_ib_tune_rnr(ic, &qp_attr); + + qp_attr.qp_state = IB_QPS_RTS; + err = ib_modify_qp(ic->i_cm_id->qp, &qp_attr, IB_QP_STATE); + if (err) + printk(KERN_NOTICE "ib_modify_qp(IB_QP_STATE, RTS): err=%d\n", err); + + /* update ib_device with this local ipaddr & conn */ + rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); + err = rds_ib_update_ipaddr(rds_ibdev, conn->c_laddr); + if (err) + printk(KERN_ERR "rds_ib_update_ipaddr failed (%d)\n", err); + err = rds_ib_add_conn(rds_ibdev, conn); + if (err) + printk(KERN_ERR "rds_ib_add_conn failed (%d)\n", err); + + /* If the peer gave us the last packet it saw, process this as if + * we had received a regular ACK. */ + if (dp && dp->dp_ack_seq) + rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL); + + rds_connect_complete(conn); +} + +static void rds_ib_cm_fill_conn_param(struct rds_connection *conn, + struct rdma_conn_param *conn_param, + struct rds_ib_connect_private *dp, + u32 protocol_version) +{ + memset(conn_param, 0, sizeof(struct rdma_conn_param)); + /* XXX tune these? */ + conn_param->responder_resources = 1; + conn_param->initiator_depth = 1; + conn_param->retry_count = 7; + conn_param->rnr_retry_count = 7; + + if (dp) { + struct rds_ib_connection *ic = conn->c_transport_data; + + memset(dp, 0, sizeof(*dp)); + dp->dp_saddr = conn->c_laddr; + dp->dp_daddr = conn->c_faddr; + dp->dp_protocol_major = RDS_PROTOCOL_MAJOR(protocol_version); + dp->dp_protocol_minor = RDS_PROTOCOL_MINOR(protocol_version); + dp->dp_protocol_minor_mask = cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS); + dp->dp_ack_seq = rds_ib_piggyb_ack(ic); + + /* Advertise flow control */ + if (ic->i_flowctl) { + unsigned int credits; + + credits = IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)); + dp->dp_credit = cpu_to_be32(credits); + atomic_sub(IB_SET_POST_CREDITS(credits), &ic->i_credits); + } + + conn_param->private_data = dp; + conn_param->private_data_len = sizeof(*dp); + } +} + +static void rds_ib_cq_event_handler(struct ib_event *event, void *data) +{ + rdsdebug("event %u data %p\n", event->event, data); +} + +static void rds_ib_qp_event_handler(struct ib_event *event, void *data) +{ + struct rds_connection *conn = data; + struct rds_ib_connection *ic = conn->c_transport_data; + + rdsdebug("conn %p ic %p event %u\n", conn, ic, event->event); + + switch (event->event) { + case IB_EVENT_COMM_EST: + rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST); + break; + default: + printk(KERN_WARNING "RDS/ib: unhandled QP event %u " + "on connection to %u.%u.%u.%u\n", event->event, + NIPQUAD(conn->c_faddr)); + break; + } +} + +/* + * This needs to be very careful to not leave IS_ERR pointers around for + * cleanup to trip over. + */ +static int rds_ib_setup_qp(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct ib_device *dev = ic->i_cm_id->device; + struct ib_qp_init_attr attr; + struct rds_ib_device *rds_ibdev; + int ret; + + /* rds_ib_add_one creates a rds_ib_device object per IB device, + * and allocates a protection domain, memory range and FMR pool + * for each. If that fails for any reason, it will not register + * the rds_ibdev at all. + */ + rds_ibdev = ib_get_client_data(dev, &rds_ib_client); + if (rds_ibdev == NULL) { + if (printk_ratelimit()) + printk(KERN_NOTICE "RDS/IB: No client_data for device %s\n", + dev->name); + return -EOPNOTSUPP; + } + + if (rds_ibdev->max_wrs < ic->i_send_ring.w_nr + 1) + rds_ib_ring_resize(&ic->i_send_ring, rds_ibdev->max_wrs - 1); + if (rds_ibdev->max_wrs < ic->i_recv_ring.w_nr + 1) + rds_ib_ring_resize(&ic->i_recv_ring, rds_ibdev->max_wrs - 1); + + /* Protection domain and memory range */ + ic->i_pd = rds_ibdev->pd; + ic->i_mr = rds_ibdev->mr; + + ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler, + rds_ib_cq_event_handler, conn, + ic->i_send_ring.w_nr + 1, 0); + if (IS_ERR(ic->i_send_cq)) { + ret = PTR_ERR(ic->i_send_cq); + ic->i_send_cq = NULL; + rdsdebug("ib_create_cq send failed: %d\n", ret); + goto out; + } + + ic->i_recv_cq = ib_create_cq(dev, rds_ib_recv_cq_comp_handler, + rds_ib_cq_event_handler, conn, + ic->i_recv_ring.w_nr, 0); + if (IS_ERR(ic->i_recv_cq)) { + ret = PTR_ERR(ic->i_recv_cq); + ic->i_recv_cq = NULL; + rdsdebug("ib_create_cq recv failed: %d\n", ret); + goto out; + } + + ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); + if (ret) { + rdsdebug("ib_req_notify_cq send failed: %d\n", ret); + goto out; + } + + ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); + if (ret) { + rdsdebug("ib_req_notify_cq recv failed: %d\n", ret); + goto out; + } + + /* XXX negotiate max send/recv with remote? */ + memset(&attr, 0, sizeof(attr)); + attr.event_handler = rds_ib_qp_event_handler; + attr.qp_context = conn; + /* + 1 to allow for the single ack message */ + attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1; + attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1; + attr.cap.max_send_sge = rds_ibdev->max_sge; + attr.cap.max_recv_sge = RDS_IB_RECV_SGE; + attr.sq_sig_type = IB_SIGNAL_REQ_WR; + attr.qp_type = IB_QPT_RC; + attr.send_cq = ic->i_send_cq; + attr.recv_cq = ic->i_recv_cq; + + /* + * XXX this can fail if max_*_wr is too large? Are we supposed + * to back off until we get a value that the hardware can support? + */ + ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr); + if (ret) { + rdsdebug("rdma_create_qp failed: %d\n", ret); + goto out; + } + + ic->i_send_hdrs = ib_dma_alloc_coherent(dev, + ic->i_send_ring.w_nr * + sizeof(struct rds_header), + &ic->i_send_hdrs_dma, GFP_KERNEL); + if (ic->i_send_hdrs == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent send failed\n"); + goto out; + } + + ic->i_recv_hdrs = ib_dma_alloc_coherent(dev, + ic->i_recv_ring.w_nr * + sizeof(struct rds_header), + &ic->i_recv_hdrs_dma, GFP_KERNEL); + if (ic->i_recv_hdrs == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent recv failed\n"); + goto out; + } + + ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header), + &ic->i_ack_dma, GFP_KERNEL); + if (ic->i_ack == NULL) { + ret = -ENOMEM; + rdsdebug("ib_dma_alloc_coherent ack failed\n"); + goto out; + } + + ic->i_sends = vmalloc(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work)); + if (ic->i_sends == NULL) { + ret = -ENOMEM; + rdsdebug("send allocation failed\n"); + goto out; + } + rds_ib_send_init_ring(ic); + + ic->i_recvs = vmalloc(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work)); + if (ic->i_recvs == NULL) { + ret = -ENOMEM; + rdsdebug("recv allocation failed\n"); + goto out; + } + + rds_ib_recv_init_ring(ic); + rds_ib_recv_init_ack(ic); + + /* Post receive buffers - as a side effect, this will update + * the posted credit count. */ + rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 1); + + rdsdebug("conn %p pd %p mr %p cq %p %p\n", conn, ic->i_pd, ic->i_mr, + ic->i_send_cq, ic->i_recv_cq); + +out: + return ret; +} + +static u32 rds_ib_protocol_compatible(const struct rds_ib_connect_private *dp) +{ + u16 common; + u32 version = 0; + + /* rdma_cm private data is odd - when there is any private data in the + * request, we will be given a pretty large buffer without telling us the + * original size. The only way to tell the difference is by looking at + * the contents, which are initialized to zero. + * If the protocol version fields aren't set, this is a connection attempt + * from an older version. This could could be 3.0 or 2.0 - we can't tell. + * We really should have changed this for OFED 1.3 :-( */ + if (dp->dp_protocol_major == 0) + return RDS_PROTOCOL_3_0; + + common = be16_to_cpu(dp->dp_protocol_minor_mask) & RDS_IB_SUPPORTED_PROTOCOLS; + if (dp->dp_protocol_major == 3 && common) { + version = RDS_PROTOCOL_3_0; + while ((common >>= 1) != 0) + version++; + } else if (printk_ratelimit()) { + printk(KERN_NOTICE "RDS: Connection from %u.%u.%u.%u using " + "incompatible protocol version %u.%u\n", + NIPQUAD(dp->dp_saddr), + dp->dp_protocol_major, + dp->dp_protocol_minor); + } + return version; +} + +static int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event) +{ + __be64 lguid = cm_id->route.path_rec->sgid.global.interface_id; + __be64 fguid = cm_id->route.path_rec->dgid.global.interface_id; + const struct rds_ib_connect_private *dp = event->param.conn.private_data; + struct rds_ib_connect_private dp_rep; + struct rds_connection *conn = NULL; + struct rds_ib_connection *ic = NULL; + struct rdma_conn_param conn_param; + u32 version; + int err, destroy = 1; + + /* Check whether the remote protocol version matches ours. */ + version = rds_ib_protocol_compatible(dp); + if (!version) + goto out; + + rdsdebug("saddr %u.%u.%u.%u daddr %u.%u.%u.%u RDSv%u.%u lguid 0x%llx fguid " + "0x%llx\n", NIPQUAD(dp->dp_saddr), NIPQUAD(dp->dp_daddr), + RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version), + (unsigned long long)be64_to_cpu(lguid), + (unsigned long long)be64_to_cpu(fguid)); + + conn = rds_conn_create(dp->dp_daddr, dp->dp_saddr, &rds_ib_transport, + GFP_KERNEL); + if (IS_ERR(conn)) { + rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn)); + conn = NULL; + goto out; + } + ic = conn->c_transport_data; + + rds_ib_set_protocol(conn, version); + rds_ib_set_flow_control(conn, be32_to_cpu(dp->dp_credit)); + + /* If the peer gave us the last packet it saw, process this as if + * we had received a regular ACK. */ + if (dp->dp_ack_seq) + rds_send_drop_acked(conn, be64_to_cpu(dp->dp_ack_seq), NULL); + + /* + * The connection request may occur while the + * previous connection exist, e.g. in case of failover. + * But as connections may be initiated simultaneously + * by both hosts, we have a random backoff mechanism - + * see the comment above rds_queue_reconnect() + */ + if (!rds_conn_transition(conn, RDS_CONN_DOWN, RDS_CONN_CONNECTING)) { + if (rds_conn_state(conn) == RDS_CONN_UP) { + rdsdebug("incoming connect while connecting\n"); + rds_conn_drop(conn); + rds_ib_stats_inc(s_ib_listen_closed_stale); + } else + if (rds_conn_state(conn) == RDS_CONN_CONNECTING) { + /* Wait and see - our connect may still be succeeding */ + rds_ib_stats_inc(s_ib_connect_raced); + } + goto out; + } + + BUG_ON(cm_id->context); + BUG_ON(ic->i_cm_id); + + ic->i_cm_id = cm_id; + cm_id->context = conn; + + /* We got halfway through setting up the ib_connection, if we + * fail now, we have to take the long route out of this mess. */ + destroy = 0; + + err = rds_ib_setup_qp(conn); + if (err) { + rds_ib_conn_error(conn, "rds_ib_setup_qp failed (%d)\n", err); + goto out; + } + + rds_ib_cm_fill_conn_param(conn, &conn_param, &dp_rep, version); + + /* rdma_accept() calls rdma_reject() internally if it fails */ + err = rdma_accept(cm_id, &conn_param); + if (err) { + rds_ib_conn_error(conn, "rdma_accept failed (%d)\n", err); + goto out; + } + + return 0; + +out: + rdma_reject(cm_id, NULL, 0); + return destroy; +} + + +static int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id) +{ + struct rds_connection *conn = cm_id->context; + struct rds_ib_connection *ic = conn->c_transport_data; + struct rdma_conn_param conn_param; + struct rds_ib_connect_private dp; + int ret; + + /* If the peer doesn't do protocol negotiation, we must + * default to RDSv3.0 */ + rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0); + ic->i_flowctl = rds_ib_sysctl_flow_control; /* advertise flow control */ + + ret = rds_ib_setup_qp(conn); + if (ret) { + rds_ib_conn_error(conn, "rds_ib_setup_qp failed (%d)\n", ret); + goto out; + } + + rds_ib_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION); + + ret = rdma_connect(cm_id, &conn_param); + if (ret) + rds_ib_conn_error(conn, "rdma_connect failed (%d)\n", ret); + +out: + /* Beware - returning non-zero tells the rdma_cm to destroy + * the cm_id. We should certainly not do it as long as we still + * "own" the cm_id. */ + if (ret) { + if (ic->i_cm_id == cm_id) + ret = 0; + } + return ret; +} + +static int rds_ib_cm_event_handler(struct rdma_cm_id *cm_id, + struct rdma_cm_event *event) +{ + /* this can be null in the listening path */ + struct rds_connection *conn = cm_id->context; + int ret = 0; + + rdsdebug("conn %p id %p handling event %u\n", conn, cm_id, + event->event); + + /* Prevent shutdown from tearing down the connection + * while we're executing. */ + if (conn) { + mutex_lock(&conn->c_cm_lock); + + /* If the connection is being shut down, bail out + * right away. We return 0 so cm_id doesn't get + * destroyed prematurely */ + if (rds_conn_state(conn) == RDS_CONN_DISCONNECTING) { + /* Reject incoming connections while we're tearing + * down an existing one. */ + if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) + ret = 1; + goto out; + } + } + + switch (event->event) { + case RDMA_CM_EVENT_CONNECT_REQUEST: + ret = rds_ib_cm_handle_connect(cm_id, event); + break; + + case RDMA_CM_EVENT_ADDR_RESOLVED: + /* XXX do we need to clean up if this fails? */ + ret = rdma_resolve_route(cm_id, + RDS_IB_RESOLVE_TIMEOUT_MS); + break; + + case RDMA_CM_EVENT_ROUTE_RESOLVED: + /* XXX worry about racing with listen acceptance */ + ret = rds_ib_cm_initiate_connect(cm_id); + break; + + case RDMA_CM_EVENT_ESTABLISHED: + rds_ib_connect_complete(conn, event); + break; + + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + case RDMA_CM_EVENT_REJECTED: + case RDMA_CM_EVENT_DEVICE_REMOVAL: + case RDMA_CM_EVENT_ADDR_CHANGE: + if (conn) + rds_conn_drop(conn); + break; + + case RDMA_CM_EVENT_DISCONNECTED: + rds_conn_drop(conn); + break; + + default: + /* things like device disconnect? */ + printk(KERN_ERR "unknown event %u\n", event->event); + BUG(); + break; + } + +out: + if (conn) { + struct rds_ib_connection *ic = conn->c_transport_data; + + /* If we return non-zero, we must to hang on to the cm_id */ + BUG_ON(ic->i_cm_id == cm_id && ret); + + mutex_unlock(&conn->c_cm_lock); + } + + rdsdebug("id %p event %u handling ret %d\n", cm_id, event->event, ret); + + return ret; +} + +int rds_ib_conn_connect(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct sockaddr_in src, dest; + int ret; + + /* XXX I wonder what affect the port space has */ + ic->i_cm_id = rdma_create_id(rds_ib_cm_event_handler, conn, + RDMA_PS_TCP); + if (IS_ERR(ic->i_cm_id)) { + ret = PTR_ERR(ic->i_cm_id); + ic->i_cm_id = NULL; + rdsdebug("rdma_create_id() failed: %d\n", ret); + goto out; + } + + rdsdebug("created cm id %p for conn %p\n", ic->i_cm_id, conn); + + src.sin_family = AF_INET; + src.sin_addr.s_addr = (__force u32)conn->c_laddr; + src.sin_port = (__force u16)htons(0); + + dest.sin_family = AF_INET; + dest.sin_addr.s_addr = (__force u32)conn->c_faddr; + dest.sin_port = (__force u16)htons(RDS_PORT); + + ret = rdma_resolve_addr(ic->i_cm_id, (struct sockaddr *)&src, + (struct sockaddr *)&dest, + RDS_IB_RESOLVE_TIMEOUT_MS); + if (ret) { + rdsdebug("addr resolve failed for cm id %p: %d\n", ic->i_cm_id, + ret); + rdma_destroy_id(ic->i_cm_id); + ic->i_cm_id = NULL; + } + +out: + return ret; +} + +/* + * This is so careful about only cleaning up resources that were built up + * so that it can be called at any point during startup. In fact it + * can be called multiple times for a given connection. + */ +void rds_ib_conn_shutdown(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + int err = 0; + + rdsdebug("cm %p pd %p cq %p %p qp %p\n", ic->i_cm_id, + ic->i_pd, ic->i_send_cq, ic->i_recv_cq, + ic->i_cm_id ? ic->i_cm_id->qp : NULL); + + if (ic->i_cm_id) { + struct ib_device *dev = ic->i_cm_id->device; + + rdsdebug("disconnecting cm %p\n", ic->i_cm_id); + err = rdma_disconnect(ic->i_cm_id); + if (err) { + /* Actually this may happen quite frequently, when + * an outgoing connect raced with an incoming connect. + */ + rdsdebug("failed to disconnect, cm: %p err %d\n", + ic->i_cm_id, err); + } + + wait_event(rds_ib_ring_empty_wait, + rds_ib_ring_empty(&ic->i_send_ring) && + rds_ib_ring_empty(&ic->i_recv_ring)); + + if (ic->i_send_hdrs) + ib_dma_free_coherent(dev, + ic->i_send_ring.w_nr * + sizeof(struct rds_header), + ic->i_send_hdrs, + ic->i_send_hdrs_dma); + + if (ic->i_recv_hdrs) + ib_dma_free_coherent(dev, + ic->i_recv_ring.w_nr * + sizeof(struct rds_header), + ic->i_recv_hdrs, + ic->i_recv_hdrs_dma); + + if (ic->i_ack) + ib_dma_free_coherent(dev, sizeof(struct rds_header), + ic->i_ack, ic->i_ack_dma); + + if (ic->i_sends) + rds_ib_send_clear_ring(ic); + if (ic->i_recvs) + rds_ib_recv_clear_ring(ic); + + if (ic->i_cm_id->qp) + rdma_destroy_qp(ic->i_cm_id); + if (ic->i_send_cq) + ib_destroy_cq(ic->i_send_cq); + if (ic->i_recv_cq) + ib_destroy_cq(ic->i_recv_cq); + rdma_destroy_id(ic->i_cm_id); + + /* + * Move connection back to the nodev list. + */ + if (ic->rds_ibdev) { + + spin_lock_irq(&ic->rds_ibdev->spinlock); + BUG_ON(list_empty(&ic->ib_node)); + list_del(&ic->ib_node); + spin_unlock_irq(&ic->rds_ibdev->spinlock); + + spin_lock_irq(&ib_nodev_conns_lock); + list_add_tail(&ic->ib_node, &ib_nodev_conns); + spin_unlock_irq(&ib_nodev_conns_lock); + ic->rds_ibdev = NULL; + } + + ic->i_cm_id = NULL; + ic->i_pd = NULL; + ic->i_mr = NULL; + ic->i_send_cq = NULL; + ic->i_recv_cq = NULL; + ic->i_send_hdrs = NULL; + ic->i_recv_hdrs = NULL; + ic->i_ack = NULL; + } + BUG_ON(ic->rds_ibdev); + + /* Clear pending transmit */ + if (ic->i_rm) { + rds_message_put(ic->i_rm); + ic->i_rm = NULL; + } + + /* Clear the ACK state */ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); +#ifdef KERNEL_HAS_ATOMIC64 + atomic64_set(&ic->i_ack_next, 0); +#else + ic->i_ack_next = 0; +#endif + ic->i_ack_recv = 0; + + /* Clear flow control state */ + ic->i_flowctl = 0; + atomic_set(&ic->i_credits, 0); + + rds_ib_ring_init(&ic->i_send_ring, rds_ib_sysctl_max_send_wr); + rds_ib_ring_init(&ic->i_recv_ring, rds_ib_sysctl_max_recv_wr); + + if (ic->i_ibinc) { + rds_inc_put(&ic->i_ibinc->ii_inc); + ic->i_ibinc = NULL; + } + + vfree(ic->i_sends); + ic->i_sends = NULL; + vfree(ic->i_recvs); + ic->i_recvs = NULL; +} + +int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp) +{ + struct rds_ib_connection *ic; + unsigned long flags; + + /* XXX too lazy? */ + ic = kzalloc(sizeof(struct rds_ib_connection), GFP_KERNEL); + if (ic == NULL) + return -ENOMEM; + + INIT_LIST_HEAD(&ic->ib_node); + mutex_init(&ic->i_recv_mutex); +#ifndef KERNEL_HAS_ATOMIC64 + spin_lock_init(&ic->i_ack_lock); +#endif + + /* + * rds_ib_conn_shutdown() waits for these to be emptied so they + * must be initialized before it can be called. + */ + rds_ib_ring_init(&ic->i_send_ring, rds_ib_sysctl_max_send_wr); + rds_ib_ring_init(&ic->i_recv_ring, rds_ib_sysctl_max_recv_wr); + + ic->conn = conn; + conn->c_transport_data = ic; + + spin_lock_irqsave(&ib_nodev_conns_lock, flags); + list_add_tail(&ic->ib_node, &ib_nodev_conns); + spin_unlock_irqrestore(&ib_nodev_conns_lock, flags); + + + rdsdebug("conn %p conn ic %p\n", conn, conn->c_transport_data); + return 0; +} + +void rds_ib_conn_free(void *arg) +{ + struct rds_ib_connection *ic = arg; + rdsdebug("ic %p\n", ic); + list_del(&ic->ib_node); + kfree(ic); +} + +int __init rds_ib_listen_init(void) +{ + struct sockaddr_in sin; + struct rdma_cm_id *cm_id; + int ret; + + cm_id = rdma_create_id(rds_ib_cm_event_handler, NULL, RDMA_PS_TCP); + if (IS_ERR(cm_id)) { + ret = PTR_ERR(cm_id); + printk(KERN_ERR "RDS/ib: failed to setup listener, " + "rdma_create_id() returned %d\n", ret); + goto out; + } + + sin.sin_family = PF_INET, + sin.sin_addr.s_addr = (__force u32)htonl(INADDR_ANY); + sin.sin_port = (__force u16)htons(RDS_PORT); + + /* + * XXX I bet this binds the cm_id to a device. If we want to support + * fail-over we'll have to take this into consideration. + */ + ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin); + if (ret) { + printk(KERN_ERR "RDS/ib: failed to setup listener, " + "rdma_bind_addr() returned %d\n", ret); + goto out; + } + + ret = rdma_listen(cm_id, 128); + if (ret) { + printk(KERN_ERR "RDS/ib: failed to setup listener, " + "rdma_listen() returned %d\n", ret); + goto out; + } + + rdsdebug("cm %p listening on port %u\n", cm_id, RDS_PORT); + + rds_ib_listen_id = cm_id; + cm_id = NULL; +out: + if (cm_id) + rdma_destroy_id(cm_id); + return ret; +} + +void rds_ib_listen_stop(void) +{ + if (rds_ib_listen_id) { + rdsdebug("cm %p\n", rds_ib_listen_id); + rdma_destroy_id(rds_ib_listen_id); + rds_ib_listen_id = NULL; + } +} + +/* + * An error occurred on the connection + */ +void +__rds_ib_conn_error(struct rds_connection *conn, const char *fmt, ...) +{ + va_list ap; + + rds_conn_drop(conn); + + va_start(ap, fmt); + vprintk(fmt, ap); + va_end(ap); +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:51 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:51 -0800 Subject: [ofa-general] [PATCH 14/21] RDS/IB: Ring-handling code. In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-15-git-send-email-andy.grover@oracle.com> Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/ib_ring.c | 168 ++++++++++++++++++++++++++++++++++ 1 files changed, 168 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_ring.c diff --git a/drivers/infiniband/ulp/rds/ib_ring.c b/drivers/infiniband/ulp/rds/ib_ring.c new file mode 100644 index 0000000..d23cc59 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_ring.c @@ -0,0 +1,168 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" +#include "ib.h" + +/* + * Locking for IB rings. + * We assume that allocation is always protected by a mutex + * in the caller (this is a valid assumption for the current + * implementation). + * + * Freeing always happens in an interrupt, and hence only + * races with allocations, but not with other free()s. + * + * The interaction between allocation and freeing is that + * the alloc code has to determine the number of free entries. + * To this end, we maintain two counters; an allocation counter + * and a free counter. Both are allowed to run freely, and wrap + * around. + * The number of used entries is always (alloc_ctr - free_ctr) % NR. + * + * The current implementation makes free_ctr atomic. When the + * caller finds an allocation fails, it should set an "alloc fail" + * bit and retry the allocation. The "alloc fail" bit essentially tells + * the CQ completion handlers to wake it up after freeing some + * more entries. + */ + +/* + * This only happens on shutdown. + */ +DECLARE_WAIT_QUEUE_HEAD(rds_ib_ring_empty_wait); + +void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr) +{ + memset(ring, 0, sizeof(*ring)); + ring->w_nr = nr; + rdsdebug("ring %p nr %u\n", ring, ring->w_nr); +} + +static inline u32 __rds_ib_ring_used(struct rds_ib_work_ring *ring) +{ + u32 diff; + + /* This assumes that atomic_t has at least as many bits as u32 */ + diff = ring->w_alloc_ctr - (u32) atomic_read(&ring->w_free_ctr); + BUG_ON(diff > ring->w_nr); + + return diff; +} + +void rds_ib_ring_resize(struct rds_ib_work_ring *ring, u32 nr) +{ + /* We only ever get called from the connection setup code, + * prior to creating the QP. */ + BUG_ON(__rds_ib_ring_used(ring)); + ring->w_nr = nr; +} + +static int __rds_ib_ring_empty(struct rds_ib_work_ring *ring) +{ + return __rds_ib_ring_used(ring) == 0; +} + +u32 rds_ib_ring_alloc(struct rds_ib_work_ring *ring, u32 val, u32 *pos) +{ + u32 ret = 0, avail; + + avail = ring->w_nr - __rds_ib_ring_used(ring); + + rdsdebug("ring %p val %u next %u free %u\n", ring, val, + ring->w_alloc_ptr, avail); + + if (val && avail) { + ret = min(val, avail); + *pos = ring->w_alloc_ptr; + + ring->w_alloc_ptr = (ring->w_alloc_ptr + ret) % ring->w_nr; + ring->w_alloc_ctr += ret; + } + + return ret; +} + +void rds_ib_ring_free(struct rds_ib_work_ring *ring, u32 val) +{ + ring->w_free_ptr = (ring->w_free_ptr + val) % ring->w_nr; + atomic_add(val, &ring->w_free_ctr); + + if (__rds_ib_ring_empty(ring) && + waitqueue_active(&rds_ib_ring_empty_wait)) + wake_up(&rds_ib_ring_empty_wait); +} + +void rds_ib_ring_unalloc(struct rds_ib_work_ring *ring, u32 val) +{ + ring->w_alloc_ptr = (ring->w_alloc_ptr - val) % ring->w_nr; + ring->w_alloc_ctr -= val; +} + +int rds_ib_ring_empty(struct rds_ib_work_ring *ring) +{ + return __rds_ib_ring_empty(ring); +} + +int rds_ib_ring_low(struct rds_ib_work_ring *ring) +{ + return __rds_ib_ring_used(ring) <= (ring->w_nr >> 2); +} + +/* + * returns the oldest alloced ring entry. This will be the next one + * freed. This can't be called if there are none allocated. + */ +u32 rds_ib_ring_oldest(struct rds_ib_work_ring *ring) +{ + return ring->w_free_ptr; +} + +/* + * returns the number of completed work requests. + */ + +u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest) +{ + u32 ret; + + if (oldest <= (unsigned long long)wr_id) + ret = (unsigned long long)wr_id - oldest + 1; + else + ret = ring->w_nr - oldest + (unsigned long long)wr_id + 1; + + rdsdebug("ring %p ret %u wr_id %u oldest %u\n", ring, ret, + wr_id, oldest); + return ret; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:44 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:44 -0800 Subject: [ofa-general] [PATCH 07/21] RDS: loopback In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-8-git-send-email-andy.grover@oracle.com> A simple rds transport to handle loopback connections. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/loop.c | 189 +++++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/loop.h | 9 ++ 2 files changed, 198 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/loop.c create mode 100644 drivers/infiniband/ulp/rds/loop.h diff --git a/drivers/infiniband/ulp/rds/loop.c b/drivers/infiniband/ulp/rds/loop.c new file mode 100644 index 0000000..40fa729 --- /dev/null +++ b/drivers/infiniband/ulp/rds/loop.c @@ -0,0 +1,189 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include + +#include "rds.h" +#include "loop.h" + +static DEFINE_SPINLOCK(loop_conns_lock); +static LIST_HEAD(loop_conns); + +/* + * This 'loopback' transport is a special case for flows that originate + * and terminate on the same machine. + * + * Connection build-up notices if the destination address is thought of + * as a local address by a transport. At that time it decides to use the + * loopback transport instead of the bound transport of the sending socket. + * + * The loopback transport's sending path just hands the sent rds_message + * straight to the receiving path via an embedded rds_incoming. + */ + +/* + * Usually a message transits both the sender and receiver's conns as it + * flows to the receiver. In the loopback case, though, the receive path + * is handed the sending conn so the sense of the addresses is reversed. + */ +static int rds_loop_xmit(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, + unsigned int off) +{ + BUG_ON(hdr_off || sg || off); + + rds_inc_init(&rm->m_inc, conn, conn->c_laddr); + rds_message_addref(rm); /* for the inc */ + + rds_recv_incoming(conn, conn->c_laddr, conn->c_faddr, &rm->m_inc, + GFP_KERNEL, KM_USER0); + + rds_send_drop_acked(conn, be64_to_cpu(rm->m_inc.i_hdr.h_sequence), + NULL); + + rds_inc_put(&rm->m_inc); + + return sizeof(struct rds_header) + be32_to_cpu(rm->m_inc.i_hdr.h_len); +} + +static int rds_loop_xmit_cong_map(struct rds_connection *conn, + struct rds_cong_map *map, + unsigned long offset) +{ + unsigned long i; + + BUG_ON(offset); + BUG_ON(map != conn->c_lcong); + + for (i = 0; i < RDS_CONG_MAP_PAGES; i++) { + memcpy((void *)conn->c_fcong->m_page_addrs[i], + (void *)map->m_page_addrs[i], PAGE_SIZE); + } + + rds_cong_map_updated(conn->c_fcong, ~(u64) 0); + + return sizeof(struct rds_header) + RDS_CONG_MAP_BYTES; +} + +/* we need to at least give the thread something to succeed */ +static int rds_loop_recv(struct rds_connection *conn) +{ + return 0; +} + +struct rds_loop_connection +{ + struct list_head loop_node; + struct rds_connection *conn; +}; + +/* + * Even the loopback transport needs to keep track of its connections, + * so it can call rds_conn_destroy() on them on exit. N.B. there are + * 1+ loopback addresses (127.*.*.*) so it's not a bug to have + * multiple loopback conns allocated, although rather useless. + */ +static int rds_loop_conn_alloc(struct rds_connection *conn, gfp_t gfp) +{ + struct rds_loop_connection *lc; + unsigned long flags; + + lc = kzalloc(sizeof(struct rds_loop_connection), GFP_KERNEL); + if (lc == NULL) + return -ENOMEM; + + INIT_LIST_HEAD(&lc->loop_node); + lc->conn = conn; + conn->c_transport_data = lc; + + spin_lock_irqsave(&loop_conns_lock, flags); + list_add_tail(&lc->loop_node, &loop_conns); + spin_unlock_irqrestore(&loop_conns_lock, flags); + + return 0; +} + +static void rds_loop_conn_free(void *arg) +{ + struct rds_loop_connection *lc = arg; + rdsdebug("lc %p\n", lc); + list_del(&lc->loop_node); + kfree(lc); +} + +static int rds_loop_conn_connect(struct rds_connection *conn) +{ + rds_connect_complete(conn); + return 0; +} + +static void rds_loop_conn_shutdown(struct rds_connection *conn) +{ +} + +void rds_loop_exit(void) +{ + struct rds_loop_connection *lc, *_lc; + LIST_HEAD(tmp_list); + + /* avoid calling conn_destroy with irqs off */ + spin_lock_irq(&loop_conns_lock); + list_splice(&loop_conns, &tmp_list); + INIT_LIST_HEAD(&loop_conns); + spin_unlock_irq(&loop_conns_lock); + + list_for_each_entry_safe(lc, _lc, &tmp_list, loop_node) { + WARN_ON(lc->conn->c_passive); + rds_conn_destroy(lc->conn); + } +} + +/* + * This is missing .xmit_* because loop doesn't go through generic + * rds_send_xmit() and doesn't call rds_recv_incoming(). .listen_stop and + * .laddr_check are missing because transport.c doesn't iterate over + * rds_loop_transport. + */ +struct rds_transport rds_loop_transport = { + .xmit = rds_loop_xmit, + .xmit_cong_map = rds_loop_xmit_cong_map, + .recv = rds_loop_recv, + .conn_alloc = rds_loop_conn_alloc, + .conn_free = rds_loop_conn_free, + .conn_connect = rds_loop_conn_connect, + .conn_shutdown = rds_loop_conn_shutdown, + .inc_copy_to_user = rds_message_inc_copy_to_user, + .inc_purge = rds_message_inc_purge, + .inc_free = rds_message_inc_free, + .t_name = "loopback", +}; diff --git a/drivers/infiniband/ulp/rds/loop.h b/drivers/infiniband/ulp/rds/loop.h new file mode 100644 index 0000000..f32b093 --- /dev/null +++ b/drivers/infiniband/ulp/rds/loop.h @@ -0,0 +1,9 @@ +#ifndef _RDS_LOOP_H +#define _RDS_LOOP_H + +/* loop.c */ +extern struct rds_transport rds_loop_transport; + +void rds_loop_exit(void); + +#endif -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:52 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:52 -0800 Subject: [ofa-general] [PATCH 15/21] RDS/IB: Implement RDMA ops using FMRs In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-16-git-send-email-andy.grover@oracle.com> Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/ib_rdma.c | 641 ++++++++++++++++++++++++++++++++++ 1 files changed, 641 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_rdma.c diff --git a/drivers/infiniband/ulp/rds/ib_rdma.c b/drivers/infiniband/ulp/rds/ib_rdma.c new file mode 100644 index 0000000..69a6289 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_rdma.c @@ -0,0 +1,641 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" +#include "rdma.h" +#include "ib.h" + + +/* + * This is stored as mr->r_trans_private. + */ +struct rds_ib_mr { + struct rds_ib_device *device; + struct rds_ib_mr_pool *pool; + struct ib_fmr *fmr; + struct list_head list; + unsigned int remap_count; + + struct scatterlist *sg; + unsigned int sg_len; + u64 *dma; + int sg_dma_len; +}; + +/* + * Our own little FMR pool + */ +struct rds_ib_mr_pool { + struct mutex flush_lock; /* serialize fmr invalidate */ + struct work_struct flush_worker; /* flush worker */ + + spinlock_t list_lock; /* protect variables below */ + atomic_t item_count; /* total # of MRs */ + atomic_t dirty_count; /* # dirty of MRs */ + struct list_head drop_list; /* MRs that have reached their max_maps limit */ + struct list_head free_list; /* unused MRs */ + struct list_head clean_list; /* unused & unamapped MRs */ + atomic_t free_pinned; /* memory pinned by free MRs */ + unsigned long max_items; + unsigned long max_items_soft; + unsigned long max_free_pinned; + struct ib_fmr_attr fmr_attr; +}; + +static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all); +static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr); +static void rds_ib_mr_pool_flush_worker(struct work_struct *work); + +static struct rds_ib_device *rds_ib_get_device(__be32 ipaddr) +{ + struct rds_ib_device *rds_ibdev; + struct rds_ib_ipaddr *i_ipaddr; + + list_for_each_entry(rds_ibdev, &rds_ib_devices, list) { + spin_lock_irq(&rds_ibdev->spinlock); + list_for_each_entry(i_ipaddr, &rds_ibdev->ipaddr_list, list) { + if (i_ipaddr->ipaddr == ipaddr) { + spin_unlock_irq(&rds_ibdev->spinlock); + return rds_ibdev; + } + } + spin_unlock_irq(&rds_ibdev->spinlock); + } + + return NULL; +} + +static int rds_ib_add_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) +{ + struct rds_ib_ipaddr *i_ipaddr; + + i_ipaddr = kmalloc(sizeof *i_ipaddr, GFP_KERNEL); + if (!i_ipaddr) + return -ENOMEM; + + i_ipaddr->ipaddr = ipaddr; + + spin_lock_irq(&rds_ibdev->spinlock); + list_add_tail(&i_ipaddr->list, &rds_ibdev->ipaddr_list); + spin_unlock_irq(&rds_ibdev->spinlock); + + return 0; +} + +static void rds_ib_remove_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) +{ + struct rds_ib_ipaddr *i_ipaddr, *next; + + spin_lock_irq(&rds_ibdev->spinlock); + list_for_each_entry_safe(i_ipaddr, next, &rds_ibdev->ipaddr_list, list) { + if (i_ipaddr->ipaddr == ipaddr) { + list_del(&i_ipaddr->list); + kfree(i_ipaddr); + break; + } + } + spin_unlock_irq(&rds_ibdev->spinlock); +} + +int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) +{ + struct rds_ib_device *rds_ibdev_old; + + rds_ibdev_old = rds_ib_get_device(ipaddr); + if (rds_ibdev_old) + rds_ib_remove_ipaddr(rds_ibdev_old, ipaddr); + + return rds_ib_add_ipaddr(rds_ibdev, ipaddr); +} + +int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + /* conn was previously on the nodev_conns_list */ + spin_lock_irq(&ib_nodev_conns_lock); + BUG_ON(list_empty(&ib_nodev_conns)); + BUG_ON(list_empty(&ic->ib_node)); + list_del(&ic->ib_node); + spin_unlock_irq(&ib_nodev_conns_lock); + + spin_lock_irq(&rds_ibdev->spinlock); + list_add_tail(&ic->ib_node, &rds_ibdev->conn_list); + spin_unlock_irq(&rds_ibdev->spinlock); + + ic->rds_ibdev = rds_ibdev; + + return 0; +} + +void rds_ib_remove_nodev_conns(void) +{ + struct rds_ib_connection *ic, *_ic; + LIST_HEAD(tmp_list); + + /* avoid calling conn_destroy with irqs off */ + spin_lock_irq(&ib_nodev_conns_lock); + list_splice(&ib_nodev_conns, &tmp_list); + INIT_LIST_HEAD(&ib_nodev_conns); + spin_unlock_irq(&ib_nodev_conns_lock); + + list_for_each_entry_safe(ic, _ic, &tmp_list, ib_node) { + if (ic->conn->c_passive) + rds_conn_destroy(ic->conn->c_passive); + rds_conn_destroy(ic->conn); + } +} + +void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev) +{ + struct rds_ib_connection *ic, *_ic; + LIST_HEAD(tmp_list); + + /* avoid calling conn_destroy with irqs off */ + spin_lock_irq(&rds_ibdev->spinlock); + list_splice(&rds_ibdev->conn_list, &tmp_list); + INIT_LIST_HEAD(&rds_ibdev->conn_list); + spin_unlock_irq(&rds_ibdev->spinlock); + + list_for_each_entry_safe(ic, _ic, &tmp_list, ib_node) { + if (ic->conn->c_passive) + rds_conn_destroy(ic->conn->c_passive); + rds_conn_destroy(ic->conn); + } +} + +struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev) +{ + struct rds_ib_mr_pool *pool; + + pool = kzalloc(sizeof(*pool), GFP_KERNEL); + if (!pool) + return ERR_PTR(-ENOMEM); + + INIT_LIST_HEAD(&pool->free_list); + INIT_LIST_HEAD(&pool->drop_list); + INIT_LIST_HEAD(&pool->clean_list); + mutex_init(&pool->flush_lock); + spin_lock_init(&pool->list_lock); + INIT_WORK(&pool->flush_worker, rds_ib_mr_pool_flush_worker); + + pool->fmr_attr.max_pages = fmr_message_size; + pool->fmr_attr.max_maps = rds_ibdev->fmr_max_remaps; + pool->fmr_attr.page_shift = rds_ibdev->fmr_page_shift; + pool->max_free_pinned = rds_ibdev->max_fmrs * fmr_message_size / 4; + + /* We never allow more than max_items MRs to be allocated. + * When we exceed more than max_items_soft, we start freeing + * items more aggressively. + * Make sure that max_items > max_items_soft > max_items / 2 + */ + pool->max_items_soft = rds_ibdev->max_fmrs * 3 / 4; + pool->max_items = rds_ibdev->max_fmrs; + + return pool; +} + +void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_rdma_connection *iinfo) +{ + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + + iinfo->rdma_mr_max = pool->max_items; + iinfo->rdma_mr_size = pool->fmr_attr.max_pages; +} + +void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *pool) +{ + flush_workqueue(rds_wq); + rds_ib_flush_mr_pool(pool, 1); + BUG_ON(atomic_read(&pool->item_count)); + BUG_ON(atomic_read(&pool->free_pinned)); + kfree(pool); +} + +static inline struct rds_ib_mr *rds_ib_reuse_fmr(struct rds_ib_mr_pool *pool) +{ + struct rds_ib_mr *ibmr = NULL; + unsigned long flags; + + spin_lock_irqsave(&pool->list_lock, flags); + if (!list_empty(&pool->clean_list)) { + ibmr = list_entry(pool->clean_list.next, struct rds_ib_mr, list); + list_del_init(&ibmr->list); + } + spin_unlock_irqrestore(&pool->list_lock, flags); + + return ibmr; +} + +static struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev) +{ + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + struct rds_ib_mr *ibmr = NULL; + int err = 0, iter = 0; + + while (1) { + ibmr = rds_ib_reuse_fmr(pool); + if (ibmr) + return ibmr; + + /* No clean MRs - now we have the choice of either + * allocating a fresh MR up to the limit imposed by the + * driver, or flush any dirty unused MRs. + * We try to avoid stalling in the send path if possible, + * so we allocate as long as we're allowed to. + * + * We're fussy with enforcing the FMR limit, though. If the driver + * tells us we can't use more than N fmrs, we shouldn't start + * arguing with it */ + if (atomic_inc_return(&pool->item_count) <= pool->max_items) + break; + + atomic_dec(&pool->item_count); + + if (++iter > 2) { + rds_ib_stats_inc(s_ib_rdma_mr_pool_depleted); + return ERR_PTR(-EAGAIN); + } + + /* We do have some empty MRs. Flush them out. */ + rds_ib_stats_inc(s_ib_rdma_mr_pool_wait); + rds_ib_flush_mr_pool(pool, 0); + } + + ibmr = kzalloc(sizeof(*ibmr), GFP_KERNEL); + if (!ibmr) { + err = -ENOMEM; + goto out_no_cigar; + } + + ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd, + (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE), + &pool->fmr_attr); + if (IS_ERR(ibmr->fmr)) { + err = PTR_ERR(ibmr->fmr); + ibmr->fmr = NULL; + printk(KERN_WARNING "RDS/IB: ib_alloc_fmr failed (err=%d)\n", err); + goto out_no_cigar; + } + + rds_ib_stats_inc(s_ib_rdma_mr_alloc); + return ibmr; + +out_no_cigar: + if (ibmr) { + if (ibmr->fmr) + ib_dealloc_fmr(ibmr->fmr); + kfree(ibmr); + } + atomic_dec(&pool->item_count); + return ERR_PTR(err); +} + +static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, + struct scatterlist *sg, unsigned int nents) +{ + struct ib_device *dev = rds_ibdev->dev; + struct scatterlist *scat = sg; + u64 io_addr = 0; + u64 *dma_pages; + u32 len; + int page_cnt, sg_dma_len; + int i, j; + int ret; + + sg_dma_len = ib_dma_map_sg(dev, sg, nents, + DMA_BIDIRECTIONAL); + if (unlikely(!sg_dma_len)) { + printk(KERN_WARNING "RDS/IB: dma_map_sg failed!\n"); + return -EBUSY; + } + + len = 0; + page_cnt = 0; + + for (i = 0; i < sg_dma_len; ++i) { + unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]); + u64 dma_addr = ib_sg_dma_address(dev, &scat[i]); + + if (dma_addr & ~rds_ibdev->fmr_page_mask) { + if (i > 0) + return -EINVAL; + else + ++page_cnt; + } + if ((dma_addr + dma_len) & ~rds_ibdev->fmr_page_mask) { + if (i < sg_dma_len - 1) + return -EINVAL; + else + ++page_cnt; + } + + len += dma_len; + } + + page_cnt += len >> rds_ibdev->fmr_page_shift; + if (page_cnt > fmr_message_size) + return -EINVAL; + + dma_pages = kmalloc(sizeof(u64) * page_cnt, GFP_ATOMIC); + if (!dma_pages) + return -ENOMEM; + + page_cnt = 0; + for (i = 0; i < sg_dma_len; ++i) { + unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]); + u64 dma_addr = ib_sg_dma_address(dev, &scat[i]); + + for (j = 0; j < dma_len; j += rds_ibdev->fmr_page_size) + dma_pages[page_cnt++] = + (dma_addr & rds_ibdev->fmr_page_mask) + j; + } + + ret = ib_map_phys_fmr(ibmr->fmr, + dma_pages, page_cnt, io_addr); + if (ret) + goto out; + + /* Success - we successfully remapped the MR, so we can + * safely tear down the old mapping. */ + rds_ib_teardown_mr(ibmr); + + ibmr->sg = scat; + ibmr->sg_len = nents; + ibmr->sg_dma_len = sg_dma_len; + ibmr->remap_count++; + + rds_ib_stats_inc(s_ib_rdma_mr_used); + ret = 0; + +out: + kfree(dma_pages); + + return ret; +} + +void rds_ib_sync_mr(void *trans_private, int direction) +{ + struct rds_ib_mr *ibmr = trans_private; + struct rds_ib_device *rds_ibdev = ibmr->device; + + switch (direction) { + case DMA_FROM_DEVICE: + ib_dma_sync_sg_for_cpu(rds_ibdev->dev, ibmr->sg, + ibmr->sg_dma_len, DMA_BIDIRECTIONAL); + break; + case DMA_TO_DEVICE: + ib_dma_sync_sg_for_device(rds_ibdev->dev, ibmr->sg, + ibmr->sg_dma_len, DMA_BIDIRECTIONAL); + break; + } +} + +static void __rds_ib_teardown_mr(struct rds_ib_mr *ibmr) +{ + struct rds_ib_device *rds_ibdev = ibmr->device; + + if (ibmr->sg_dma_len) { + ib_dma_unmap_sg(rds_ibdev->dev, + ibmr->sg, ibmr->sg_len, + DMA_BIDIRECTIONAL); + ibmr->sg_dma_len = 0; + } + + /* Release the s/g list */ + if (ibmr->sg_len) { + unsigned int i; + + for (i = 0; i < ibmr->sg_len; ++i) { + struct page *page = sg_page(&ibmr->sg[i]); + + /* FIXME we need a way to tell a r/w MR + * from a r/o MR */ + set_page_dirty(page); + put_page(page); + } + kfree(ibmr->sg); + + ibmr->sg = NULL; + ibmr->sg_len = 0; + } +} + +static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr) +{ + unsigned int pinned = ibmr->sg_len; + + __rds_ib_teardown_mr(ibmr); + if (pinned) { + struct rds_ib_device *rds_ibdev = ibmr->device; + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + + atomic_sub(pinned, &pool->free_pinned); + } +} + +static inline unsigned int rds_ib_flush_goal(struct rds_ib_mr_pool *pool, int free_all) +{ + unsigned int item_count; + + item_count = atomic_read(&pool->item_count); + if (free_all) + return item_count; + + return 0; +} + +/* + * Flush our pool of MRs. + * At a minimum, all currently unused MRs are unmapped. + * If the number of MRs allocated exceeds the limit, we also try + * to free as many MRs as needed to get back to this limit. + */ +static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all) +{ + struct rds_ib_mr *ibmr, *next; + LIST_HEAD(unmap_list); + LIST_HEAD(fmr_list); + unsigned long unpinned = 0; + unsigned long flags; + unsigned int nfreed = 0, ncleaned = 0, free_goal; + int ret = 0; + + rds_ib_stats_inc(s_ib_rdma_mr_pool_flush); + + mutex_lock(&pool->flush_lock); + + spin_lock_irqsave(&pool->list_lock, flags); + /* Get the list of all MRs to be dropped. Ordering matters - + * we want to put drop_list ahead of free_list. */ + list_splice_init(&pool->free_list, &unmap_list); + list_splice_init(&pool->drop_list, &unmap_list); + if (free_all) + list_splice_init(&pool->clean_list, &unmap_list); + spin_unlock_irqrestore(&pool->list_lock, flags); + + free_goal = rds_ib_flush_goal(pool, free_all); + + if (list_empty(&unmap_list)) + goto out; + + /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ + list_for_each_entry(ibmr, &unmap_list, list) + list_add(&ibmr->fmr->list, &fmr_list); + ret = ib_unmap_fmr(&fmr_list); + if (ret) + printk(KERN_WARNING "RDS/IB: ib_unmap_fmr failed (err=%d)\n", ret); + + /* Now we can destroy the DMA mapping and unpin any pages */ + list_for_each_entry_safe(ibmr, next, &unmap_list, list) { + unpinned += ibmr->sg_len; + __rds_ib_teardown_mr(ibmr); + if (nfreed < free_goal || ibmr->remap_count >= pool->fmr_attr.max_maps) { + rds_ib_stats_inc(s_ib_rdma_mr_free); + list_del(&ibmr->list); + ib_dealloc_fmr(ibmr->fmr); + kfree(ibmr); + nfreed++; + } + ncleaned++; + } + + spin_lock_irqsave(&pool->list_lock, flags); + list_splice(&unmap_list, &pool->clean_list); + spin_unlock_irqrestore(&pool->list_lock, flags); + + atomic_sub(unpinned, &pool->free_pinned); + atomic_sub(ncleaned, &pool->dirty_count); + atomic_sub(nfreed, &pool->item_count); + +out: + mutex_unlock(&pool->flush_lock); + return ret; +} + +static void rds_ib_mr_pool_flush_worker(struct work_struct *work) +{ + struct rds_ib_mr_pool *pool = container_of(work, struct rds_ib_mr_pool, flush_worker); + + rds_ib_flush_mr_pool(pool, 0); +} + +void rds_ib_free_mr(void *trans_private, int invalidate) +{ + struct rds_ib_mr *ibmr = trans_private; + struct rds_ib_device *rds_ibdev = ibmr->device; + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + unsigned long flags; + + rdsdebug("RDS/IB: free_mr nents %u\n", ibmr->sg_len); + + /* Return it to the pool's free list */ + spin_lock_irqsave(&pool->list_lock, flags); + if (ibmr->remap_count >= pool->fmr_attr.max_maps) + list_add(&ibmr->list, &pool->drop_list); + else + list_add(&ibmr->list, &pool->free_list); + + atomic_add(ibmr->sg_len, &pool->free_pinned); + atomic_inc(&pool->dirty_count); + spin_unlock_irqrestore(&pool->list_lock, flags); + + /* If we've pinned too many pages, request a flush */ + if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned + || atomic_read(&pool->dirty_count) >= pool->max_items / 10) + queue_work(rds_wq, &pool->flush_worker); + + if (invalidate) { + if (likely(!in_interrupt())) { + rds_ib_flush_mr_pool(pool, 0); + } else { + /* We get here if the user created a MR marked + * as use_once and invalidate at the same time. */ + queue_work(rds_wq, &pool->flush_worker); + } + } +} + +void rds_ib_flush_mrs(void) +{ + struct rds_ib_device *rds_ibdev; + + list_for_each_entry(rds_ibdev, &rds_ib_devices, list) { + struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; + + if (pool) + rds_ib_flush_mr_pool(pool, 0); + } +} + +void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, + struct rds_sock *rs, u32 *key_ret) +{ + struct rds_ib_device *rds_ibdev; + struct rds_ib_mr *ibmr = NULL; + int ret; + + rds_ibdev = rds_ib_get_device(rs->rs_bound_addr); + if (!rds_ibdev) { + ret = -ENODEV; + goto out; + } + + if (!rds_ibdev->mr_pool) { + ret = -ENODEV; + goto out; + } + + ibmr = rds_ib_alloc_fmr(rds_ibdev); + if (IS_ERR(ibmr)) + return ibmr; + + ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents); + if (ret == 0) + *key_ret = ibmr->fmr->rkey; + else + printk(KERN_WARNING "RDS/IB: map_fmr failed (errno=%d)\n", ret); + + ibmr->device = rds_ibdev; + + out: + if (ret) { + if (ibmr) + rds_ib_free_mr(ibmr, 0); + ibmr = ERR_PTR(ret); + } + return ibmr; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:53 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:53 -0800 Subject: [ofa-general] [PATCH 16/21] RDS/IB: Implement IB-specific datagram send. In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-17-git-send-email-andy.grover@oracle.com> Specific to IB is a credits-based flow control mechanism, in addition to the expected usage of the IB API to package outgoing data into work requests. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/ib_send.c | 852 ++++++++++++++++++++++++++++++++++ 1 files changed, 852 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_send.c diff --git a/drivers/infiniband/ulp/rds/ib_send.c b/drivers/infiniband/ulp/rds/ib_send.c new file mode 100644 index 0000000..20af976 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_send.c @@ -0,0 +1,852 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include + +#include "rds.h" +#include "rdma.h" +#include "ib.h" + +static void rds_ib_send_rdma_complete(struct rds_message *rm, + int wc_status) +{ + int notify_status; + + switch (wc_status) { + case IB_WC_WR_FLUSH_ERR: + return; + + case IB_WC_SUCCESS: + notify_status = RDS_RDMA_SUCCESS; + break; + + case IB_WC_REM_ACCESS_ERR: + notify_status = RDS_RDMA_REMOTE_ERROR; + break; + + default: + notify_status = RDS_RDMA_OTHER_ERROR; + break; + } + rds_rdma_send_complete(rm, notify_status); +} + +static void rds_ib_send_unmap_rdma(struct rds_ib_connection *ic, + struct rds_rdma_op *op) +{ + if (op->r_mapped) { + ib_dma_unmap_sg(ic->i_cm_id->device, + op->r_sg, op->r_nents, + op->r_write ? DMA_TO_DEVICE : DMA_FROM_DEVICE); + op->r_mapped = 0; + } +} + +static void rds_ib_send_unmap_rm(struct rds_ib_connection *ic, + struct rds_ib_send_work *send, + int wc_status) +{ + struct rds_message *rm = send->s_rm; + + rdsdebug("ic %p send %p rm %p\n", ic, send, rm); + + ib_dma_unmap_sg(ic->i_cm_id->device, + rm->m_sg, rm->m_nents, + DMA_TO_DEVICE); + + if (rm->m_rdma_op != NULL) { + rds_ib_send_unmap_rdma(ic, rm->m_rdma_op); + + /* If the user asked for a completion notification on this + * message, we can implement three different semantics: + * 1. Notify when we received the ACK on the RDS message + * that was queued with the RDMA. This provides reliable + * notification of RDMA status at the expense of a one-way + * packet delay. + * 2. Notify when the IB stack gives us the completion event for + * the RDMA operation. + * 3. Notify when the IB stack gives us the completion event for + * the accompanying RDS messages. + * Here, we implement approach #3. To implement approach #2, + * call rds_rdma_send_complete from the cq_handler. To implement #1, + * don't call rds_rdma_send_complete at all, and fall back to the notify + * handling in the ACK processing code. + * + * Note: There's no need to explicitly sync any RDMA buffers using + * ib_dma_sync_sg_for_cpu - the completion for the RDMA + * operation itself unmapped the RDMA buffers, which takes care + * of synching. + */ + rds_ib_send_rdma_complete(rm, wc_status); + + if (rm->m_rdma_op->r_write) + rds_stats_add(s_send_rdma_bytes, rm->m_rdma_op->r_bytes); + else + rds_stats_add(s_recv_rdma_bytes, rm->m_rdma_op->r_bytes); + } + + /* If anyone waited for this message to get flushed out, wake + * them up now */ + rds_message_unmapped(rm); + + rds_message_put(rm); + send->s_rm = NULL; +} + +void rds_ib_send_init_ring(struct rds_ib_connection *ic) +{ + struct rds_ib_send_work *send; + u32 i; + + for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) { + struct ib_sge *sge; + + send->s_rm = NULL; + send->s_op = NULL; + + send->s_wr.wr_id = i; + send->s_wr.sg_list = send->s_sge; + send->s_wr.num_sge = 1; + send->s_wr.opcode = IB_WR_SEND; + send->s_wr.send_flags = 0; + send->s_wr.ex.imm_data = 0; + + sge = rds_ib_data_sge(ic, send->s_sge); + sge->lkey = ic->i_mr->lkey; + + sge = rds_ib_header_sge(ic, send->s_sge); + sge->addr = ic->i_send_hdrs_dma + (i * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = ic->i_mr->lkey; + } +} + +void rds_ib_send_clear_ring(struct rds_ib_connection *ic) +{ + struct rds_ib_send_work *send; + u32 i; + + for (i = 0, send = ic->i_sends; i < ic->i_send_ring.w_nr; i++, send++) { + if (send->s_wr.opcode == 0xdead) + continue; + if (send->s_rm) + rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR); + if (send->s_op) + rds_ib_send_unmap_rdma(ic, send->s_op); + } +} + +/* + * The _oldest/_free ring operations here race cleanly with the alloc/unalloc + * operations performed in the send path. As the sender allocs and potentially + * unallocs the next free entry in the ring it doesn't alter which is + * the next to be freed, which is what this is concerned with. + */ +void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context) +{ + struct rds_connection *conn = context; + struct rds_ib_connection *ic = conn->c_transport_data; + struct ib_wc wc; + struct rds_ib_send_work *send; + u32 completed; + u32 oldest; + u32 i = 0; + int ret; + + rdsdebug("cq %p conn %p\n", cq, conn); + rds_ib_stats_inc(s_ib_tx_cq_call); + ret = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); + if (ret) + rdsdebug("ib_req_notify_cq send failed: %d\n", ret); + + while (ib_poll_cq(cq, 1, &wc) > 0) { + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", + (unsigned long long)wc.wr_id, wc.status, wc.byte_len, + be32_to_cpu(wc.ex.imm_data)); + rds_ib_stats_inc(s_ib_tx_cq_event); + + if (wc.wr_id == RDS_IB_ACK_WR_ID) { + if (ic->i_ack_queued + HZ/2 < jiffies) + rds_ib_stats_inc(s_ib_tx_stalled); + rds_ib_ack_send_complete(ic); + continue; + } + + oldest = rds_ib_ring_oldest(&ic->i_send_ring); + + completed = rds_ib_ring_completed(&ic->i_send_ring, wc.wr_id, oldest); + + for (i = 0; i < completed; i++) { + send = &ic->i_sends[oldest]; + + /* In the error case, wc.opcode sometimes contains garbage */ + switch (send->s_wr.opcode) { + case IB_WR_SEND: + if (send->s_rm) + rds_ib_send_unmap_rm(ic, send, wc.status); + break; + case IB_WR_RDMA_WRITE: + case IB_WR_RDMA_READ: + /* Nothing to be done - the SG list will be unmapped + * when the SEND completes. */ + break; + default: + if (printk_ratelimit()) + printk(KERN_NOTICE + "RDS/IB: %s: unexpected opcode 0x%x in WR!\n", + __func__, send->s_wr.opcode); + break; + } + + send->s_wr.opcode = 0xdead; + send->s_wr.num_sge = 1; + if (send->s_queued + HZ/2 < jiffies) + rds_ib_stats_inc(s_ib_tx_stalled); + + /* If a RDMA operation produced an error, signal this right + * away. If we don't, the subsequent SEND that goes with this + * RDMA will be canceled with ERR_WFLUSH, and the application + * never learn that the RDMA failed. */ + if (unlikely(wc.status == IB_WC_REM_ACCESS_ERR && send->s_op)) { + struct rds_message *rm; + + rm = rds_send_get_message(conn, send->s_op); + if (rm) + rds_ib_send_rdma_complete(rm, wc.status); + } + + oldest = (oldest + 1) % ic->i_send_ring.w_nr; + } + + rds_ib_ring_free(&ic->i_send_ring, completed); + + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags) + || test_bit(0, &conn->c_map_queued)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + + /* We expect errors as the qp is drained during shutdown */ + if (wc.status != IB_WC_SUCCESS && rds_conn_up(conn)) { + rds_ib_conn_error(conn, + "send completion on %u.%u.%u.%u " + "had status %u, disconnecting and reconnecting\n", + NIPQUAD(conn->c_faddr), wc.status); + } + } +} + +/* + * This is the main function for allocating credits when sending + * messages. + * + * Conceptually, we have two counters: + * - send credits: this tells us how many WRs we're allowed + * to submit without overruning the reciever's queue. For + * each SEND WR we post, we decrement this by one. + * + * - posted credits: this tells us how many WRs we recently + * posted to the receive queue. This value is transferred + * to the peer as a "credit update" in a RDS header field. + * Every time we transmit credits to the peer, we subtract + * the amount of transferred credits from this counter. + * + * It is essential that we avoid situations where both sides have + * exhausted their send credits, and are unable to send new credits + * to the peer. We achieve this by requiring that we send at least + * one credit update to the peer before exhausting our credits. + * When new credits arrive, we subtract one credit that is withheld + * until we've posted new buffers and are ready to transmit these + * credits (see rds_ib_send_add_credits below). + * + * The RDS send code is essentially single-threaded; rds_send_xmit + * grabs c_send_lock to ensure exclusive access to the send ring. + * However, the ACK sending code is independent and can race with + * message SENDs. + * + * In the send path, we need to update the counters for send credits + * and the counter of posted buffers atomically - when we use the + * last available credit, we cannot allow another thread to race us + * and grab the posted credits counter. Hence, we have to use a + * spinlock to protect the credit counter, or use atomics. + * + * Spinlocks shared between the send and the receive path are bad, + * because they create unnecessary delays. An early implementation + * using a spinlock showed a 5% degradation in throughput at some + * loads. + * + * This implementation avoids spinlocks completely, putting both + * counters into a single atomic, and updating that atomic using + * atomic_add (in the receive path, when receiving fresh credits), + * and using atomic_cmpxchg when updating the two counters. + */ +int rds_ib_send_grab_credits(struct rds_ib_connection *ic, + u32 wanted, u32 *adv_credits) +{ + unsigned int avail, posted, got = 0, advertise; + long oldval, newval; + + *adv_credits = 0; + if (!ic->i_flowctl) + return wanted; + +try_again: + advertise = 0; + oldval = newval = atomic_read(&ic->i_credits); + posted = IB_GET_POST_CREDITS(oldval); + avail = IB_GET_SEND_CREDITS(oldval); + + rdsdebug("rds_ib_send_grab_credits(%u): credits=%u posted=%u\n", + wanted, avail, posted); + + /* The last credit must be used to send a credit update. */ + if (avail && !posted) + avail--; + + if (avail < wanted) { + struct rds_connection *conn = ic->i_cm_id->context; + + /* Oops, there aren't that many credits left! */ + set_bit(RDS_LL_SEND_FULL, &conn->c_flags); + got = avail; + } else { + /* Sometimes you get what you want, lalala. */ + got = wanted; + } + newval -= IB_SET_SEND_CREDITS(got); + + if (got && posted) { + advertise = min_t(unsigned int, posted, RDS_MAX_ADV_CREDIT); + newval -= IB_SET_POST_CREDITS(advertise); + } + + /* Finally bill everything */ + if (atomic_cmpxchg(&ic->i_credits, oldval, newval) != oldval) + goto try_again; + + *adv_credits = advertise; + return got; +} + +void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (credits == 0) + return; + + rdsdebug("rds_ib_send_add_credits(%u): current=%u%s\n", + credits, + IB_GET_SEND_CREDITS(atomic_read(&ic->i_credits)), + test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ? ", ll_send_full" : ""); + + atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits); + if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags)) + queue_delayed_work(rds_wq, &conn->c_send_w, 0); + + WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384); + + rds_ib_stats_inc(s_ib_rx_credit_updates); +} + +void rds_ib_advertise_credits(struct rds_connection *conn, unsigned int posted) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + if (posted == 0) + return; + + atomic_add(IB_SET_POST_CREDITS(posted), &ic->i_credits); + + /* Decide whether to send an update to the peer now. + * If we would send a credit update for every single buffer we + * post, we would end up with an ACK storm (ACK arrives, + * consumes buffer, we refill the ring, send ACK to remote + * advertising the newly posted buffer... ad inf) + * + * Performance pretty much depends on how often we send + * credit updates - too frequent updates mean lots of ACKs. + * Too infrequent updates, and the peer will run out of + * credits and has to throttle. + * For the time being, 16 seems to be a good compromise. + */ + if (IB_GET_POST_CREDITS(atomic_read(&ic->i_credits)) >= 16) + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); +} + +static inline void +rds_ib_xmit_populate_wr(struct rds_ib_connection *ic, + struct rds_ib_send_work *send, unsigned int pos, + unsigned long buffer, unsigned int length, + int send_flags) +{ + struct ib_sge *sge; + + WARN_ON(pos != send - ic->i_sends); + + send->s_wr.send_flags = send_flags; + send->s_wr.opcode = IB_WR_SEND; + send->s_wr.num_sge = 2; + send->s_wr.next = NULL; + send->s_queued = jiffies; + send->s_op = NULL; + + if (length != 0) { + sge = rds_ib_data_sge(ic, send->s_sge); + sge->addr = buffer; + sge->length = length; + sge->lkey = ic->i_mr->lkey; + + sge = rds_ib_header_sge(ic, send->s_sge); + } else { + /* We're sending a packet with no payload. There is only + * one SGE */ + send->s_wr.num_sge = 1; + sge = &send->s_sge[0]; + } + + sge->addr = ic->i_send_hdrs_dma + (pos * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = ic->i_mr->lkey; +} + +/* + * This can be called multiple times for a given message. The first time + * we see a message we map its scatterlist into the IB device so that + * we can provide that mapped address to the IB scatter gather entries + * in the IB work requests. We translate the scatterlist into a series + * of work requests that fragment the message. These work requests complete + * in order so we pass ownership of the message to the completion handler + * once we send the final fragment. + * + * The RDS core uses the c_send_lock to only enter this function once + * per connection. This makes sure that the tx ring alloc/unalloc pairs + * don't get out of sync and confuse the ring. + */ +int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, + unsigned int hdr_off, unsigned int sg, unsigned int off) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct ib_device *dev = ic->i_cm_id->device; + struct rds_ib_send_work *send = NULL; + struct rds_ib_send_work *first; + struct rds_ib_send_work *prev; + struct ib_send_wr *failed_wr; + struct scatterlist *scat; + u32 pos; + u32 i; + u32 work_alloc; + u32 credit_alloc; + u32 adv_credits = 0; + int send_flags = 0; + int sent; + int ret; + + BUG_ON(off % RDS_FRAG_SIZE); + BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); + + /* FIXME we may overallocate here */ + if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) + i = 1; + else + i = ceil(be32_to_cpu(rm->m_inc.i_hdr.h_len), RDS_FRAG_SIZE); + + work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos); + if (work_alloc == 0) { + set_bit(RDS_LL_SEND_FULL, &conn->c_flags); + rds_ib_stats_inc(s_ib_tx_ring_full); + ret = -ENOMEM; + goto out; + } + + credit_alloc = work_alloc; + if (ic->i_flowctl) { + credit_alloc = rds_ib_send_grab_credits(ic, work_alloc, &adv_credits); + if (credit_alloc < work_alloc) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - credit_alloc); + work_alloc = credit_alloc; + } + if (work_alloc == 0) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_ib_stats_inc(s_ib_tx_throttle); + ret = -ENOMEM; + goto out; + } + } + + /* map the message the first time we see it */ + if (ic->i_rm == NULL) { + /* + printk(KERN_NOTICE "rds_ib_xmit prep msg dport=%u flags=0x%x len=%d\n", + be16_to_cpu(rm->m_inc.i_hdr.h_dport), + rm->m_inc.i_hdr.h_flags, + be32_to_cpu(rm->m_inc.i_hdr.h_len)); + */ + if (rm->m_nents) { + rm->m_count = ib_dma_map_sg(dev, + rm->m_sg, rm->m_nents, DMA_TO_DEVICE); + rdsdebug("ic %p mapping rm %p: %d\n", ic, rm, rm->m_count); + if (rm->m_count == 0) { + rds_ib_stats_inc(s_ib_tx_sg_mapping_failure); + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + ret = -ENOMEM; /* XXX ? */ + goto out; + } + } else { + rm->m_count = 0; + } + + ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs; + ic->i_unsignaled_bytes = rds_ib_sysctl_max_unsig_bytes; + rds_message_addref(rm); + ic->i_rm = rm; + + /* Finalize the header */ + if (test_bit(RDS_MSG_ACK_REQUIRED, &rm->m_flags)) + rm->m_inc.i_hdr.h_flags |= RDS_FLAG_ACK_REQUIRED; + if (test_bit(RDS_MSG_RETRANSMITTED, &rm->m_flags)) + rm->m_inc.i_hdr.h_flags |= RDS_FLAG_RETRANSMITTED; + + /* If it has a RDMA op, tell the peer we did it. This is + * used by the peer to release use-once RDMA MRs. */ + if (rm->m_rdma_op) { + struct rds_ext_header_rdma ext_hdr; + + ext_hdr.h_rdma_rkey = cpu_to_be32(rm->m_rdma_op->r_key); + rds_message_add_extension(&rm->m_inc.i_hdr, + RDS_EXTHDR_RDMA, &ext_hdr, sizeof(ext_hdr)); + } + if (rm->m_rdma_cookie) { + rds_message_add_rdma_dest_extension(&rm->m_inc.i_hdr, + rds_rdma_cookie_key(rm->m_rdma_cookie), + rds_rdma_cookie_offset(rm->m_rdma_cookie)); + } + + /* Note - rds_ib_piggyb_ack clears the ACK_REQUIRED bit, so + * we should not do this unless we have a chance of at least + * sticking the header into the send ring. Which is why we + * should call rds_ib_ring_alloc first. */ + rm->m_inc.i_hdr.h_ack = cpu_to_be64(rds_ib_piggyb_ack(ic)); + rds_message_make_checksum(&rm->m_inc.i_hdr); + } else if (ic->i_rm != rm) + BUG(); + + send = &ic->i_sends[pos]; + first = send; + prev = NULL; + scat = &rm->m_sg[sg]; + sent = 0; + i = 0; + + /* Sometimes you want to put a fence between an RDMA + * READ and the following SEND. + * We could either do this all the time + * or when requested by the user. Right now, we let + * the application choose. + */ + if (rm->m_rdma_op && rm->m_rdma_op->r_fence) + send_flags = IB_SEND_FENCE; + + /* + * We could be copying the header into the unused tail of the page. + * That would need to be changed in the future when those pages might + * be mapped userspace pages or page cache pages. So instead we always + * use a second sge and our long-lived ring of mapped headers. We send + * the header after the data so that the data payload can be aligned on + * the receiver. + */ + + /* handle a 0-len message */ + if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) { + rds_ib_xmit_populate_wr(ic, send, pos, 0, 0, send_flags); + goto add_header; + } + + /* if there's data reference it with a chain of work reqs */ + for (; i < work_alloc && scat != &rm->m_sg[rm->m_count]; i++) { + unsigned int len; + + send = &ic->i_sends[pos]; + + len = min(RDS_FRAG_SIZE, ib_sg_dma_len(dev, scat) - off); + rds_ib_xmit_populate_wr(ic, send, pos, + ib_sg_dma_address(dev, scat) + off, len, + send_flags); + + /* + * We want to delay signaling completions just enough to get + * the batching benefits but not so much that we create dead time + * on the wire. + */ + if (ic->i_unsignaled_wrs-- == 0) { + ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs; + send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + } + + ic->i_unsignaled_bytes -= len; + if (ic->i_unsignaled_bytes <= 0) { + ic->i_unsignaled_bytes = rds_ib_sysctl_max_unsig_bytes; + send->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + } + + rdsdebug("send %p wr %p num_sge %u next %p\n", send, + &send->s_wr, send->s_wr.num_sge, send->s_wr.next); + + sent += len; + off += len; + if (off == ib_sg_dma_len(dev, scat)) { + scat++; + off = 0; + } + +add_header: + /* Tack on the header after the data. The header SGE should already + * have been set up to point to the right header buffer. */ + memcpy(&ic->i_send_hdrs[pos], &rm->m_inc.i_hdr, sizeof(struct rds_header)); + + if (0) { + struct rds_header *hdr = &ic->i_send_hdrs[pos]; + + printk(KERN_NOTICE "send WR dport=%u flags=0x%x len=%d\n", + be16_to_cpu(hdr->h_dport), + hdr->h_flags, + be32_to_cpu(hdr->h_len)); + } + if (adv_credits) { + struct rds_header *hdr = &ic->i_send_hdrs[pos]; + + /* add credit and redo the header checksum */ + hdr->h_credit = adv_credits; + rds_message_make_checksum(hdr); + adv_credits = 0; + rds_ib_stats_inc(s_ib_tx_credit_updates); + } + + if (prev) + prev->s_wr.next = &send->s_wr; + prev = send; + + pos = (pos + 1) % ic->i_send_ring.w_nr; + } + + /* Account the RDS header in the number of bytes we sent, but just once. + * The caller has no concept of fragmentation. */ + if (hdr_off == 0) + sent += sizeof(struct rds_header); + + /* if we finished the message then send completion owns it */ + if (scat == &rm->m_sg[rm->m_count]) { + prev->s_rm = ic->i_rm; + prev->s_wr.send_flags |= IB_SEND_SIGNALED | IB_SEND_SOLICITED; + ic->i_rm = NULL; + } + + if (i < work_alloc) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i); + work_alloc = i; + } + if (ic->i_flowctl && i < credit_alloc) + rds_ib_send_add_credits(conn, credit_alloc - i); + + /* XXX need to worry about failed_wr and partial sends. */ + failed_wr = &first->s_wr; + ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr); + rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic, + first, &first->s_wr, ret, failed_wr); + BUG_ON(failed_wr != &first->s_wr); + if (ret) { + printk(KERN_WARNING "RDS/IB: ib_post_send to %u.%u.%u.%u " + "returned %d\n", NIPQUAD(conn->c_faddr), ret); + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + if (prev->s_rm) { + ic->i_rm = prev->s_rm; + prev->s_rm = NULL; + } + /* Finesse this later */ + BUG(); + goto out; + } + + ret = sent; +out: + BUG_ON(adv_credits); + return ret; +} + +int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct rds_ib_send_work *send = NULL; + struct rds_ib_send_work *first; + struct rds_ib_send_work *prev; + struct ib_send_wr *failed_wr; + struct rds_ib_device *rds_ibdev; + struct scatterlist *scat; + unsigned long len; + u64 remote_addr = op->r_remote_addr; + u32 pos; + u32 work_alloc; + u32 i; + u32 j; + int sent; + int ret; + int num_sge; + + rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); + + /* map the message the first time we see it */ + if (!op->r_mapped) { + op->r_count = ib_dma_map_sg(ic->i_cm_id->device, + op->r_sg, op->r_nents, (op->r_write) ? + DMA_TO_DEVICE : DMA_FROM_DEVICE); + rdsdebug("ic %p mapping op %p: %d\n", ic, op, op->r_count); + if (op->r_count == 0) { + rds_ib_stats_inc(s_ib_tx_sg_mapping_failure); + ret = -ENOMEM; /* XXX ? */ + goto out; + } + + op->r_mapped = 1; + } + + /* + * Instead of knowing how to return a partial rdma read/write we insist that there + * be enough work requests to send the entire message. + */ + i = ceil(op->r_count, rds_ibdev->max_sge); + + work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos); + if (work_alloc != i) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_ib_stats_inc(s_ib_tx_ring_full); + ret = -ENOMEM; + goto out; + } + + send = &ic->i_sends[pos]; + first = send; + prev = NULL; + scat = &op->r_sg[0]; + sent = 0; + num_sge = op->r_count; + + for (i = 0; i < work_alloc && scat != &op->r_sg[op->r_count]; i++) { + send->s_wr.send_flags = 0; + send->s_queued = jiffies; + /* + * We want to delay signaling completions just enough to get + * the batching benefits but not so much that we create dead time on the wire. + */ + if (ic->i_unsignaled_wrs-- == 0) { + ic->i_unsignaled_wrs = rds_ib_sysctl_max_unsig_wrs; + send->s_wr.send_flags = IB_SEND_SIGNALED; + } + + send->s_wr.opcode = op->r_write ? IB_WR_RDMA_WRITE : IB_WR_RDMA_READ; + send->s_wr.wr.rdma.remote_addr = remote_addr; + send->s_wr.wr.rdma.rkey = op->r_key; + send->s_op = op; + + if (num_sge > rds_ibdev->max_sge) { + send->s_wr.num_sge = rds_ibdev->max_sge; + num_sge -= rds_ibdev->max_sge; + } else { + send->s_wr.num_sge = num_sge; + } + + send->s_wr.next = NULL; + + if (prev) + prev->s_wr.next = &send->s_wr; + + for (j = 0; j < send->s_wr.num_sge && scat != &op->r_sg[op->r_count]; j++) { + len = ib_sg_dma_len(ic->i_cm_id->device, scat); + send->s_sge[j].addr = + ib_sg_dma_address(ic->i_cm_id->device, scat); + send->s_sge[j].length = len; + send->s_sge[j].lkey = ic->i_mr->lkey; + + sent += len; + rdsdebug("ic %p sent %d remote_addr %llu\n", ic, sent, remote_addr); + + remote_addr += len; + scat++; + } + + rdsdebug("send %p wr %p num_sge %u next %p\n", send, + &send->s_wr, send->s_wr.num_sge, send->s_wr.next); + + prev = send; + if (++send == &ic->i_sends[ic->i_send_ring.w_nr]) + send = ic->i_sends; + } + + /* if we finished the message then send completion owns it */ + if (scat == &op->r_sg[op->r_count]) + prev->s_wr.send_flags = IB_SEND_SIGNALED; + + if (i < work_alloc) { + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc - i); + work_alloc = i; + } + + failed_wr = &first->s_wr; + ret = ib_post_send(ic->i_cm_id->qp, &first->s_wr, &failed_wr); + rdsdebug("ic %p first %p (wr %p) ret %d wr %p\n", ic, + first, &first->s_wr, ret, failed_wr); + BUG_ON(failed_wr != &first->s_wr); + if (ret) { + printk(KERN_WARNING "RDS/IB: rdma ib_post_send to %u.%u.%u.%u " + "returned %d\n", NIPQUAD(conn->c_faddr), ret); + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + goto out; + } + + if (unlikely(failed_wr != &first->s_wr)) { + printk(KERN_WARNING "RDS/IB: ib_post_send() rc=%d, but failed_wqe updated!\n", ret); + BUG_ON(failed_wr != &first->s_wr); + } + + +out: + return ret; +} + +void rds_ib_xmit_complete(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + + /* We may have a pending ACK or window update we were unable + * to send previously (due to flow control). Try again. */ + rds_ib_attempt_ack(ic); +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:54 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:54 -0800 Subject: [ofa-general] [PATCH 17/21] RDS/IB: Receive datagrams via IB In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-18-git-send-email-andy.grover@oracle.com> Header parsing, ring refill. It puts the incoming data into an rds_incoming struct, which is passed up to rds-core. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/ib_recv.c | 894 ++++++++++++++++++++++++++++++++++ 1 files changed, 894 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_recv.c diff --git a/drivers/infiniband/ulp/rds/ib_recv.c b/drivers/infiniband/ulp/rds/ib_recv.c new file mode 100644 index 0000000..516f858 --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_recv.c @@ -0,0 +1,894 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include +#include + +#include "rds.h" +#include "ib.h" + +static struct kmem_cache *rds_ib_incoming_slab; +static struct kmem_cache *rds_ib_frag_slab; +static atomic_t rds_ib_allocation = ATOMIC_INIT(0); + +static void rds_ib_frag_drop_page(struct rds_page_frag *frag) +{ + rdsdebug("frag %p page %p\n", frag, frag->f_page); + __free_page(frag->f_page); + frag->f_page = NULL; +} + +static void rds_ib_frag_free(struct rds_page_frag *frag) +{ + rdsdebug("frag %p page %p\n", frag, frag->f_page); + BUG_ON(frag->f_page != NULL); + kmem_cache_free(rds_ib_frag_slab, frag); +} + +/* + * We map a page at a time. Its fragments are posted in order. This + * is called in fragment order as the fragments get send completion events. + * Only the last frag in the page performs the unmapping. + * + * It's OK for ring cleanup to call this in whatever order it likes because + * DMA is not in flight and so we can unmap while other ring entries still + * hold page references in their frags. + */ +static void rds_ib_recv_unmap_page(struct rds_ib_connection *ic, + struct rds_ib_recv_work *recv) +{ + struct rds_page_frag *frag = recv->r_frag; + + rdsdebug("recv %p frag %p page %p\n", recv, frag, frag->f_page); + if (frag->f_mapped) + ib_dma_unmap_page(ic->i_cm_id->device, + frag->f_mapped, + RDS_FRAG_SIZE, DMA_FROM_DEVICE); + frag->f_mapped = 0; +} + +void rds_ib_recv_init_ring(struct rds_ib_connection *ic) +{ + struct rds_ib_recv_work *recv; + u32 i; + + for (i = 0, recv = ic->i_recvs; i < ic->i_recv_ring.w_nr; i++, recv++) { + struct ib_sge *sge; + + recv->r_ibinc = NULL; + recv->r_frag = NULL; + + recv->r_wr.next = NULL; + recv->r_wr.wr_id = i; + recv->r_wr.sg_list = recv->r_sge; + recv->r_wr.num_sge = RDS_IB_RECV_SGE; + + sge = rds_ib_data_sge(ic, recv->r_sge); + sge->addr = 0; + sge->length = RDS_FRAG_SIZE; + sge->lkey = ic->i_mr->lkey; + + sge = rds_ib_header_sge(ic, recv->r_sge); + sge->addr = ic->i_recv_hdrs_dma + (i * sizeof(struct rds_header)); + sge->length = sizeof(struct rds_header); + sge->lkey = ic->i_mr->lkey; + } +} + +static void rds_ib_recv_clear_one(struct rds_ib_connection *ic, + struct rds_ib_recv_work *recv) +{ + if (recv->r_ibinc) { + rds_inc_put(&recv->r_ibinc->ii_inc); + recv->r_ibinc = NULL; + } + if (recv->r_frag) { + rds_ib_recv_unmap_page(ic, recv); + if (recv->r_frag->f_page) + rds_ib_frag_drop_page(recv->r_frag); + rds_ib_frag_free(recv->r_frag); + recv->r_frag = NULL; + } +} + +void rds_ib_recv_clear_ring(struct rds_ib_connection *ic) +{ + u32 i; + + for (i = 0; i < ic->i_recv_ring.w_nr; i++) + rds_ib_recv_clear_one(ic, &ic->i_recvs[i]); + + if (ic->i_frag.f_page) + rds_ib_frag_drop_page(&ic->i_frag); +} + +static int rds_ib_recv_refill_one(struct rds_connection *conn, + struct rds_ib_recv_work *recv, + gfp_t kptr_gfp, gfp_t page_gfp) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + dma_addr_t dma_addr; + struct ib_sge *sge; + int ret = -ENOMEM; + + if (recv->r_ibinc == NULL) { + if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) { + rds_ib_stats_inc(s_ib_rx_alloc_limit); + goto out; + } + recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab, + kptr_gfp); + if (recv->r_ibinc == NULL) + goto out; + atomic_inc(&rds_ib_allocation); + INIT_LIST_HEAD(&recv->r_ibinc->ii_frags); + rds_inc_init(&recv->r_ibinc->ii_inc, conn, conn->c_faddr); + } + + if (recv->r_frag == NULL) { + recv->r_frag = kmem_cache_alloc(rds_ib_frag_slab, kptr_gfp); + if (recv->r_frag == NULL) + goto out; + INIT_LIST_HEAD(&recv->r_frag->f_item); + recv->r_frag->f_page = NULL; + } + + if (ic->i_frag.f_page == NULL) { + ic->i_frag.f_page = alloc_page(page_gfp); + if (ic->i_frag.f_page == NULL) + goto out; + ic->i_frag.f_offset = 0; + } + + dma_addr = ib_dma_map_page(ic->i_cm_id->device, + ic->i_frag.f_page, + ic->i_frag.f_offset, + RDS_FRAG_SIZE, + DMA_FROM_DEVICE); + if (ib_dma_mapping_error(ic->i_cm_id->device, dma_addr)) + goto out; + + /* + * Once we get the RDS_PAGE_LAST_OFF frag then rds_ib_frag_unmap() + * must be called on this recv. This happens as completions hit + * in order or on connection shutdown. + */ + recv->r_frag->f_page = ic->i_frag.f_page; + recv->r_frag->f_offset = ic->i_frag.f_offset; + recv->r_frag->f_mapped = dma_addr; + + sge = rds_ib_data_sge(ic, recv->r_sge); + sge->addr = dma_addr; + sge->length = RDS_FRAG_SIZE; + + sge = rds_ib_header_sge(ic, recv->r_sge); + sge->addr = ic->i_recv_hdrs_dma + (recv - ic->i_recvs) * sizeof(struct rds_header); + sge->length = sizeof(struct rds_header); + + get_page(recv->r_frag->f_page); + + if (ic->i_frag.f_offset < RDS_PAGE_LAST_OFF) { + ic->i_frag.f_offset += RDS_FRAG_SIZE; + } else { + put_page(ic->i_frag.f_page); + ic->i_frag.f_page = NULL; + ic->i_frag.f_offset = 0; + } + + ret = 0; +out: + return ret; +} + +/* + * This tries to allocate and post unused work requests after making sure that + * they have all the allocations they need to queue received fragments into + * sockets. The i_recv_mutex is held here so that ring_alloc and _unalloc + * pairs don't go unmatched. + * + * -1 is returned if posting fails due to temporary resource exhaustion. + */ +int rds_ib_recv_refill(struct rds_connection *conn, gfp_t kptr_gfp, + gfp_t page_gfp, int prefill) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct rds_ib_recv_work *recv; + struct ib_recv_wr *failed_wr; + unsigned int posted = 0; + int ret = 0; + u32 pos; + + while ((prefill || rds_conn_up(conn)) + && rds_ib_ring_alloc(&ic->i_recv_ring, 1, &pos)) { + if (pos >= ic->i_recv_ring.w_nr) { + printk(KERN_NOTICE "Argh - ring alloc returned pos=%u\n", + pos); + ret = -EINVAL; + break; + } + + recv = &ic->i_recvs[pos]; + ret = rds_ib_recv_refill_one(conn, recv, kptr_gfp, page_gfp); + if (ret) { + ret = -1; + break; + } + + /* XXX when can this fail? */ + ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr); + rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv, + recv->r_ibinc, recv->r_frag->f_page, + (long) recv->r_frag->f_mapped, ret); + if (ret) { + rds_ib_conn_error(conn, "recv post on " + "%u.%u.%u.%u returned %d, disconnecting and " + "reconnecting\n", NIPQUAD(conn->c_faddr), + ret); + ret = -1; + break; + } + + posted++; + } + + /* We're doing flow control - update the window. */ + if (ic->i_flowctl && posted) + rds_ib_advertise_credits(conn, posted); + + if (ret) + rds_ib_ring_unalloc(&ic->i_recv_ring, 1); + return ret; +} + +void rds_ib_inc_purge(struct rds_incoming *inc) +{ + struct rds_ib_incoming *ibinc; + struct rds_page_frag *frag; + struct rds_page_frag *pos; + + ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); + rdsdebug("purging ibinc %p inc %p\n", ibinc, inc); + + list_for_each_entry_safe(frag, pos, &ibinc->ii_frags, f_item) { + list_del_init(&frag->f_item); + rds_ib_frag_drop_page(frag); + rds_ib_frag_free(frag); + } +} + +void rds_ib_inc_free(struct rds_incoming *inc) +{ + struct rds_ib_incoming *ibinc; + + ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); + + rds_ib_inc_purge(inc); + rdsdebug("freeing ibinc %p inc %p\n", ibinc, inc); + BUG_ON(!list_empty(&ibinc->ii_frags)); + kmem_cache_free(rds_ib_incoming_slab, ibinc); + atomic_dec(&rds_ib_allocation); + BUG_ON(atomic_read(&rds_ib_allocation) < 0); +} + +int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iovec *first_iov, + size_t size) +{ + struct rds_ib_incoming *ibinc; + struct rds_page_frag *frag; + struct iovec *iov = first_iov; + unsigned long to_copy; + unsigned long frag_off = 0; + unsigned long iov_off = 0; + int copied = 0; + int ret; + u32 len; + + ibinc = container_of(inc, struct rds_ib_incoming, ii_inc); + frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item); + len = be32_to_cpu(inc->i_hdr.h_len); + + while (copied < size && copied < len) { + if (frag_off == RDS_FRAG_SIZE) { + frag = list_entry(frag->f_item.next, + struct rds_page_frag, f_item); + frag_off = 0; + } + while (iov_off == iov->iov_len) { + iov_off = 0; + iov++; + } + + to_copy = min(iov->iov_len - iov_off, RDS_FRAG_SIZE - frag_off); + to_copy = min_t(size_t, to_copy, size - copied); + to_copy = min_t(unsigned long, to_copy, len - copied); + + rdsdebug("%lu bytes to user [%p, %zu] + %lu from frag " + "[%p, %lu] + %lu\n", + to_copy, iov->iov_base, iov->iov_len, iov_off, + frag->f_page, frag->f_offset, frag_off); + + /* XXX needs + offset for multiple recvs per page */ + ret = rds_page_copy_to_user(frag->f_page, + frag->f_offset + frag_off, + iov->iov_base + iov_off, + to_copy); + if (ret) { + copied = ret; + break; + } + + iov_off += to_copy; + frag_off += to_copy; + copied += to_copy; + } + + return copied; +} + +/* ic starts out kzalloc()ed */ +void rds_ib_recv_init_ack(struct rds_ib_connection *ic) +{ + struct ib_send_wr *wr = &ic->i_ack_wr; + struct ib_sge *sge = &ic->i_ack_sge; + + sge->addr = ic->i_ack_dma; + sge->length = sizeof(struct rds_header); + sge->lkey = ic->i_mr->lkey; + + wr->sg_list = sge; + wr->num_sge = 1; + wr->opcode = IB_WR_SEND; + wr->wr_id = RDS_IB_ACK_WR_ID; + wr->send_flags = IB_SEND_SIGNALED | IB_SEND_SOLICITED; +} + +/* + * You'd think that with reliable IB connections you wouldn't need to ack + * messages that have been received. The problem is that IB hardware generates + * an ack message before it has DMAed the message into memory. This creates a + * potential message loss if the HCA is disabled for any reason between when it + * sends the ack and before the message is DMAed and processed. This is only a + * potential issue if another HCA is available for fail-over. + * + * When the remote host receives our ack they'll free the sent message from + * their send queue. To decrease the latency of this we always send an ack + * immediately after we've received messages. + * + * For simplicity, we only have one ack in flight at a time. This puts + * pressure on senders to have deep enough send queues to absorb the latency of + * a single ack frame being in flight. This might not be good enough. + * + * This is implemented by have a long-lived send_wr and sge which point to a + * statically allocated ack frame. This ack wr does not fall under the ring + * accounting that the tx and rx wrs do. The QP attribute specifically makes + * room for it beyond the ring size. Send completion notices its special + * wr_id and avoids working with the ring in that case. + */ +#ifndef KERNEL_HAS_ATOMIC64 +static void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq, + int ack_required) +{ + unsigned long flags; + + spin_lock_irqsave(&ic->i_ack_lock, flags); + ic->i_ack_next = seq; + if (ack_required) + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + spin_unlock_irqrestore(&ic->i_ack_lock, flags); +} + +static u64 rds_ib_get_ack(struct rds_ib_connection *ic) +{ + unsigned long flags; + u64 seq; + + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + + spin_lock_irqsave(&ic->i_ack_lock, flags); + seq = ic->i_ack_next; + spin_unlock_irqrestore(&ic->i_ack_lock, flags); + + return seq; +} +#else +static void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq, + int ack_required) +{ + atomic64_set(&ic->i_ack_next, seq); + if (ack_required) { + smp_mb__before_clear_bit(); + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + } +} + +static u64 rds_ib_get_ack(struct rds_ib_connection *ic) +{ + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + smp_mb__after_clear_bit(); + + return atomic64_read(&ic->i_ack_next); +} +#endif + + +static void rds_ib_send_ack(struct rds_ib_connection *ic, unsigned int adv_credits) +{ + struct rds_header *hdr = ic->i_ack; + struct ib_send_wr *failed_wr; + u64 seq; + int ret; + + seq = rds_ib_get_ack(ic); + + rdsdebug("send_ack: ic %p ack %llu\n", ic, (unsigned long long) seq); + rds_message_populate_header(hdr, 0, 0, 0); + hdr->h_ack = cpu_to_be64(seq); + hdr->h_credit = adv_credits; + rds_message_make_checksum(hdr); + ic->i_ack_queued = jiffies; + + ret = ib_post_send(ic->i_cm_id->qp, &ic->i_ack_wr, &failed_wr); + if (unlikely(ret)) { + /* Failed to send. Release the WR, and + * force another ACK. + */ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + set_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + + rds_ib_stats_inc(s_ib_ack_send_failure); + /* Need to finesse this later. */ + BUG(); + } else + rds_ib_stats_inc(s_ib_ack_sent); +} + +/* + * There are 3 ways of getting acknowledgements to the peer: + * 1. We call rds_ib_attempt_ack from the recv completion handler + * to send an ACK-only frame. + * However, there can be only one such frame in the send queue + * at any time, so we may have to postpone it. + * 2. When another (data) packet is transmitted while there's + * an ACK in the queue, we piggyback the ACK sequence number + * on the data packet. + * 3. If the ACK WR is done sending, we get called from the + * send queue completion handler, and check whether there's + * another ACK pending (postponed because the WR was on the + * queue). If so, we transmit it. + * + * We maintain 2 variables: + * - i_ack_flags, which keeps track of whether the ACK WR + * is currently in the send queue or not (IB_ACK_IN_FLIGHT) + * - i_ack_next, which is the last sequence number we received + * + * Potentially, send queue and receive queue handlers can run concurrently. + * It would be nice to not have to use a spinlock to synchronize things, + * but the one problem that rules this out is that 64bit updates are + * not atomic on all platforms. Things would be a lot simpler if + * we had atomic64 or maybe cmpxchg64 everywhere. + * + * Reconnecting complicates this picture just slightly. When we + * reconnect, we may be seeing duplicate packets. The peer + * is retransmitting them, because it hasn't seen an ACK for + * them. It is important that we ACK these. + * + * ACK mitigation adds a header flag "ACK_REQUIRED"; any packet with + * this flag set *MUST* be acknowledged immediately. + */ + +/* + * When we get here, we're called from the recv queue handler. + * Check whether we ought to transmit an ACK. + */ +void rds_ib_attempt_ack(struct rds_ib_connection *ic) +{ + unsigned int adv_credits; + + if (!test_bit(IB_ACK_REQUESTED, &ic->i_ack_flags)) + return; + + if (test_and_set_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags)) { + rds_ib_stats_inc(s_ib_ack_send_delayed); + return; + } + + /* Can we get a send credit? */ + if (!rds_ib_send_grab_credits(ic, 1, &adv_credits)) { + rds_ib_stats_inc(s_ib_tx_throttle); + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + return; + } + + clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags); + rds_ib_send_ack(ic, adv_credits); +} + +/* + * We get here from the send completion handler, when the + * adapter tells us the ACK frame was sent. + */ +void rds_ib_ack_send_complete(struct rds_ib_connection *ic) +{ + clear_bit(IB_ACK_IN_FLIGHT, &ic->i_ack_flags); + rds_ib_attempt_ack(ic); +} + +/* + * This is called by the regular xmit code when it wants to piggyback + * an ACK on an outgoing frame. + */ +u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic) +{ + if (test_and_clear_bit(IB_ACK_REQUESTED, &ic->i_ack_flags)) + rds_ib_stats_inc(s_ib_ack_send_piggybacked); + return rds_ib_get_ack(ic); +} + +/* + * It's kind of lame that we're copying from the posted receive pages into + * long-lived bitmaps. We could have posted the bitmaps and rdma written into + * them. But receiving new congestion bitmaps should be a *rare* event, so + * hopefully we won't need to invest that complexity in making it more + * efficient. By copying we can share a simpler core with TCP which has to + * copy. + */ +static void rds_ib_cong_recv(struct rds_connection *conn, + struct rds_ib_incoming *ibinc) +{ + struct rds_cong_map *map; + unsigned int map_off; + unsigned int map_page; + struct rds_page_frag *frag; + unsigned long frag_off; + unsigned long to_copy; + unsigned long copied; + uint64_t uncongested = 0; + void *addr; + + /* catch completely corrupt packets */ + if (be32_to_cpu(ibinc->ii_inc.i_hdr.h_len) != RDS_CONG_MAP_BYTES) + return; + + map = conn->c_fcong; + map_page = 0; + map_off = 0; + + frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item); + frag_off = 0; + + copied = 0; + + while (copied < RDS_CONG_MAP_BYTES) { + uint64_t *src, *dst; + unsigned int k; + + to_copy = min(RDS_FRAG_SIZE - frag_off, PAGE_SIZE - map_off); + BUG_ON(to_copy & 7); /* Must be 64bit aligned. */ + + addr = kmap_atomic(frag->f_page, KM_SOFTIRQ0); + + src = addr + frag_off; + dst = (void *)map->m_page_addrs[map_page] + map_off; + for (k = 0; k < to_copy; k += 8) { + /* Record ports that became uncongested, ie + * bits that changed from 0 to 1. */ + uncongested |= ~(*src) & *dst; + *dst++ = *src++; + } + kunmap_atomic(addr, KM_SOFTIRQ0); + + copied += to_copy; + + map_off += to_copy; + if (map_off == PAGE_SIZE) { + map_off = 0; + map_page++; + } + + frag_off += to_copy; + if (frag_off == RDS_FRAG_SIZE) { + frag = list_entry(frag->f_item.next, + struct rds_page_frag, f_item); + frag_off = 0; + } + } + + /* the congestion map is in little endian order */ + uncongested = le64_to_cpu(uncongested); + + rds_cong_map_updated(map, uncongested); +} + +/* + * Rings are posted with all the allocations they'll need to queue the + * incoming message to the receiving socket so this can't fail. + * All fragments start with a header, so we can make sure we're not receiving + * garbage, and we can tell a small 8 byte fragment from an ACK frame. + */ +struct rds_ib_ack_state { + u64 ack_next; + u64 ack_recv; + unsigned int ack_required:1; + unsigned int ack_next_valid:1; + unsigned int ack_recv_valid:1; +}; + +static void rds_ib_process_recv(struct rds_connection *conn, + struct rds_ib_recv_work *recv, u32 byte_len, + struct rds_ib_ack_state *state) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + struct rds_ib_incoming *ibinc = ic->i_ibinc; + struct rds_header *ihdr, *hdr; + + /* XXX shut down the connection if port 0,0 are seen? */ + + rdsdebug("ic %p ibinc %p recv %p byte len %u\n", ic, ibinc, recv, + byte_len); + + if (byte_len < sizeof(struct rds_header)) { + rds_ib_conn_error(conn, "incoming message " + "from %u.%u.%u.%u didn't inclue a " + "header, disconnecting and " + "reconnecting\n", + NIPQUAD(conn->c_faddr)); + return; + } + byte_len -= sizeof(struct rds_header); + + ihdr = &ic->i_recv_hdrs[recv - ic->i_recvs]; + + /* Validate the checksum. */ + if (!rds_message_verify_checksum(ihdr)) { + rds_ib_conn_error(conn, "incoming message " + "from %u.%u.%u.%u has corrupted header - " + "forcing a reconnect\n", + NIPQUAD(conn->c_faddr)); + rds_stats_inc(s_recv_drop_bad_checksum); + return; + } + + /* Process the ACK sequence which comes with every packet */ + state->ack_recv = be64_to_cpu(ihdr->h_ack); + state->ack_recv_valid = 1; + + /* Process the credits update if there was one */ + if (ihdr->h_credit) + rds_ib_send_add_credits(conn, ihdr->h_credit); + + if (ihdr->h_sport == 0 && ihdr->h_dport == 0 && byte_len == 0) { + /* This is an ACK-only packet. The fact that it gets + * special treatment here is that historically, ACKs + * were rather special beasts. + */ + rds_ib_stats_inc(s_ib_ack_received); + + /* + * Usually the frags make their way on to incs and are then freed as + * the inc is freed. We don't go that route, so we have to drop the + * page ref ourselves. We can't just leave the page on the recv + * because that confuses the dma mapping of pages and each recv's use + * of a partial page. We can leave the frag, though, it will be + * reused. + * + * FIXME: Fold this into the code path below. + */ + rds_ib_frag_drop_page(recv->r_frag); + return; + } + + /* + * If we don't already have an inc on the connection then this + * fragment has a header and starts a message.. copy its header + * into the inc and save the inc so we can hang upcoming fragments + * off its list. + */ + if (ibinc == NULL) { + ibinc = recv->r_ibinc; + recv->r_ibinc = NULL; + ic->i_ibinc = ibinc; + + hdr = &ibinc->ii_inc.i_hdr; + memcpy(hdr, ihdr, sizeof(*hdr)); + ic->i_recv_data_rem = be32_to_cpu(hdr->h_len); + + rdsdebug("ic %p ibinc %p rem %u flag 0x%x\n", ic, ibinc, + ic->i_recv_data_rem, hdr->h_flags); + } else { + hdr = &ibinc->ii_inc.i_hdr; + /* We can't just use memcmp here; fragments of a + * single message may carry different ACKs */ + if (hdr->h_sequence != ihdr->h_sequence + || hdr->h_len != ihdr->h_len + || hdr->h_sport != ihdr->h_sport + || hdr->h_dport != ihdr->h_dport) { + rds_ib_conn_error(conn, + "fragment header mismatch; forcing reconnect\n"); + return; + } + } + + list_add_tail(&recv->r_frag->f_item, &ibinc->ii_frags); + recv->r_frag = NULL; + + if (ic->i_recv_data_rem > RDS_FRAG_SIZE) + ic->i_recv_data_rem -= RDS_FRAG_SIZE; + else { + ic->i_recv_data_rem = 0; + ic->i_ibinc = NULL; + + if (ibinc->ii_inc.i_hdr.h_flags == RDS_FLAG_CONG_BITMAP) + rds_ib_cong_recv(conn, ibinc); + else { + rds_recv_incoming(conn, conn->c_faddr, conn->c_laddr, + &ibinc->ii_inc, GFP_ATOMIC, + KM_SOFTIRQ0); + state->ack_next = be64_to_cpu(hdr->h_sequence); + state->ack_next_valid = 1; + } + + /* Evaluate the ACK_REQUIRED flag *after* we received + * the complete frame, and after bumping the next_rx + * sequence. */ + if (hdr->h_flags & RDS_FLAG_ACK_REQUIRED) { + rds_stats_inc(s_recv_ack_required); + state->ack_required = 1; + } + + rds_inc_put(&ibinc->ii_inc); + } +} + +/* + * Plucking the oldest entry from the ring can be done concurrently with + * the thread refilling the ring. Each ring operation is protected by + * spinlocks and the transient state of refilling doesn't change the + * recording of which entry is oldest. + * + * This relies on IB only calling one cq comp_handler for each cq so that + * there will only be one caller of rds_recv_incoming() per RDS connection. + */ +void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context) +{ + struct rds_connection *conn = context; + struct rds_ib_connection *ic = conn->c_transport_data; + struct ib_wc wc; + struct rds_ib_ack_state state = { 0, }; + struct rds_ib_recv_work *recv; + + rdsdebug("conn %p cq %p\n", conn, cq); + + rds_ib_stats_inc(s_ib_rx_cq_call); + + ib_req_notify_cq(cq, IB_CQ_SOLICITED); + + while (ib_poll_cq(cq, 1, &wc) > 0) { + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", + (unsigned long long)wc.wr_id, wc.status, wc.byte_len, + be32_to_cpu(wc.ex.imm_data)); + rds_ib_stats_inc(s_ib_rx_cq_event); + + recv = &ic->i_recvs[rds_ib_ring_oldest(&ic->i_recv_ring)]; + + rds_ib_recv_unmap_page(ic, recv); + + if (rds_conn_up(conn)) { + /* We expect errors as the qp is drained during shutdown */ + if (wc.status == IB_WC_SUCCESS) { + rds_ib_process_recv(conn, recv, wc.byte_len, &state); + } else { + rds_ib_conn_error(conn, "recv completion on " + "%u.%u.%u.%u had status %u, disconnecting and " + "reconnecting\n", NIPQUAD(conn->c_faddr), + wc.status); + } + } + + rds_ib_ring_free(&ic->i_recv_ring, 1); + } + + if (state.ack_next_valid) + rds_ib_set_ack(ic, state.ack_next, state.ack_required); + if (state.ack_recv_valid && state.ack_recv > ic->i_ack_recv) { + rds_send_drop_acked(conn, state.ack_recv, NULL); + ic->i_ack_recv = state.ack_recv; + } + if (rds_conn_up(conn)) + rds_ib_attempt_ack(ic); + + /* If we ever end up with a really empty receive ring, we're + * in deep trouble, as the sender will definitely see RNR + * timeouts. */ + if (rds_ib_ring_empty(&ic->i_recv_ring)) + rds_ib_stats_inc(s_ib_rx_ring_empty); + + /* + * If the ring is running low, then schedule the thread to refill. + */ + if (rds_ib_ring_low(&ic->i_recv_ring)) + queue_delayed_work(rds_wq, &conn->c_recv_w, 0); +} + +int rds_ib_recv(struct rds_connection *conn) +{ + struct rds_ib_connection *ic = conn->c_transport_data; + int ret = 0; + + rdsdebug("conn %p\n", conn); + + /* + * If we get a temporary posting failure in this context then + * we're really low and we want the caller to back off for a bit. + */ + mutex_lock(&ic->i_recv_mutex); + if (rds_ib_recv_refill(conn, GFP_KERNEL, GFP_HIGHUSER, 0)) + ret = -ENOMEM; + else + rds_ib_stats_inc(s_ib_rx_refill_from_thread); + mutex_unlock(&ic->i_recv_mutex); + + return ret; +} + +int __init rds_ib_recv_init(void) +{ + struct sysinfo si; + int ret = -ENOMEM; + + /* Default to 30% of all available RAM for recv memory */ + si_meminfo(&si); + rds_ib_sysctl_max_recv_allocation = si.totalram / 3 * PAGE_SIZE / RDS_FRAG_SIZE; + + rds_ib_incoming_slab = kmem_cache_create("rds_ib_incoming", + sizeof(struct rds_ib_incoming), + 0, 0, NULL); + if (rds_ib_incoming_slab == NULL) + goto out; + + rds_ib_frag_slab = kmem_cache_create("rds_ib_frag", + sizeof(struct rds_page_frag), + 0, 0, NULL); + if (rds_ib_frag_slab == NULL) + kmem_cache_destroy(rds_ib_incoming_slab); + else + ret = 0; +out: + return ret; +} + +void rds_ib_recv_exit(void) +{ + kmem_cache_destroy(rds_ib_incoming_slab); + kmem_cache_destroy(rds_ib_frag_slab); +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:46 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:46 -0800 Subject: [ofa-general] [PATCH 09/21] RDS: Message parsing In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-10-git-send-email-andy.grover@oracle.com> Parsing of newly-received RDS message headers (including ext. headers) and copy-to/from-user routines. page.c implements a per-cpu page remainder cache, to reduce the number of allocations needed for small datagrams. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/message.c | 414 ++++++++++++++++++++++++++++++++++ drivers/infiniband/ulp/rds/page.c | 222 ++++++++++++++++++ 2 files changed, 636 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/message.c create mode 100644 drivers/infiniband/ulp/rds/page.c diff --git a/drivers/infiniband/ulp/rds/message.c b/drivers/infiniband/ulp/rds/message.c new file mode 100644 index 0000000..5cad4d5 --- /dev/null +++ b/drivers/infiniband/ulp/rds/message.c @@ -0,0 +1,414 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" +#include "rdma.h" + +static DECLARE_WAIT_QUEUE_HEAD(rds_message_flush_waitq); + +static unsigned int rds_exthdr_size[__RDS_EXTHDR_MAX] = { +[RDS_EXTHDR_NONE] = 0, +[RDS_EXTHDR_VERSION] = sizeof(struct rds_ext_header_version), +[RDS_EXTHDR_RDMA] = sizeof(struct rds_ext_header_rdma), +[RDS_EXTHDR_RDMA_DEST] = sizeof(struct rds_ext_header_rdma_dest), +}; + + +void rds_message_addref(struct rds_message *rm) +{ + rdsdebug("addref rm %p ref %d\n", rm, atomic_read(&rm->m_refcount)); + atomic_inc(&rm->m_refcount); +} +EXPORT_SYMBOL_GPL(rds_message_addref); + +/* + * This relies on dma_map_sg() not touching sg[].page during merging. + */ +static void rds_message_purge(struct rds_message *rm) +{ + unsigned long i; + + if (unlikely(test_bit(RDS_MSG_PAGEVEC, &rm->m_flags))) + return; + + for (i = 0; i < rm->m_nents; i++) { + rdsdebug("putting data page %p\n", (void *)sg_page(&rm->m_sg[i])); + /* XXX will have to put_page for page refs */ + __free_page(sg_page(&rm->m_sg[i])); + } + rm->m_nents = 0; + + if (rm->m_rdma_op) + rds_rdma_free_op(rm->m_rdma_op); + if (rm->m_rdma_mr) + rds_mr_put(rm->m_rdma_mr); +} + +void rds_message_inc_purge(struct rds_incoming *inc) +{ + struct rds_message *rm = container_of(inc, struct rds_message, m_inc); + rds_message_purge(rm); +} + +void rds_message_put(struct rds_message *rm) +{ + rdsdebug("put rm %p ref %d\n", rm, atomic_read(&rm->m_refcount)); + + if (atomic_dec_and_test(&rm->m_refcount)) { + BUG_ON(!list_empty(&rm->m_sock_item)); + BUG_ON(!list_empty(&rm->m_conn_item)); + rds_message_purge(rm); + + kfree(rm); + } +} +EXPORT_SYMBOL_GPL(rds_message_put); + +void rds_message_inc_free(struct rds_incoming *inc) +{ + struct rds_message *rm = container_of(inc, struct rds_message, m_inc); + rds_message_put(rm); +} + +void rds_message_populate_header(struct rds_header *hdr, __be16 sport, + __be16 dport, u64 seq) +{ + hdr->h_flags = 0; + hdr->h_sport = sport; + hdr->h_dport = dport; + hdr->h_sequence = cpu_to_be64(seq); + hdr->h_exthdr[0] = RDS_EXTHDR_NONE; +} +EXPORT_SYMBOL_GPL(rds_message_populate_header); + +int rds_message_add_extension(struct rds_header *hdr, + unsigned int type, const void *data, unsigned int len) +{ + unsigned int ext_len = sizeof(u8) + len; + unsigned char *dst; + + /* For now, refuse to add more than one extension header */ + if (hdr->h_exthdr[0] != RDS_EXTHDR_NONE) + return 0; + + if (type >= __RDS_EXTHDR_MAX + || len != rds_exthdr_size[type]) + return 0; + + if (ext_len >= RDS_HEADER_EXT_SPACE) + return 0; + dst = hdr->h_exthdr; + + *dst++ = type; + memcpy(dst, data, len); + + dst[len] = RDS_EXTHDR_NONE; + return 1; +} +EXPORT_SYMBOL_GPL(rds_message_add_extension); + +/* + * If a message has extension headers, retrieve them here. + * Call like this: + * + * unsigned int pos = 0; + * + * while (1) { + * buflen = sizeof(buffer); + * type = rds_message_next_extension(hdr, &pos, buffer, &buflen); + * if (type == RDS_EXTHDR_NONE) + * break; + * ... + * } + */ +int rds_message_next_extension(struct rds_header *hdr, + unsigned int *pos, void *buf, unsigned int *buflen) +{ + unsigned int offset, ext_type, ext_len; + u8 *src = hdr->h_exthdr; + + offset = *pos; + if (offset >= RDS_HEADER_EXT_SPACE) + goto none; + + /* Get the extension type and length. For now, the + * length is implied by the extension type. */ + ext_type = src[offset++]; + + if (ext_type == RDS_EXTHDR_NONE || ext_type >= __RDS_EXTHDR_MAX) + goto none; + ext_len = rds_exthdr_size[ext_type]; + if (offset + ext_len > RDS_HEADER_EXT_SPACE) + goto none; + + *pos = offset + ext_len; + if (ext_len < *buflen) + *buflen = ext_len; + memcpy(buf, src + offset, *buflen); + return ext_type; + +none: + *pos = RDS_HEADER_EXT_SPACE; + *buflen = 0; + return RDS_EXTHDR_NONE; +} + +int rds_message_add_version_extension(struct rds_header *hdr, unsigned int version) +{ + struct rds_ext_header_version ext_hdr; + + ext_hdr.h_version = cpu_to_be32(version); + return rds_message_add_extension(hdr, RDS_EXTHDR_VERSION, &ext_hdr, sizeof(ext_hdr)); +} + +int rds_message_get_version_extension(struct rds_header *hdr, unsigned int *version) +{ + struct rds_ext_header_version ext_hdr; + unsigned int pos = 0, len = sizeof(ext_hdr); + + /* We assume the version extension is the only one present */ + if (rds_message_next_extension(hdr, &pos, &ext_hdr, &len) != RDS_EXTHDR_VERSION) + return 0; + *version = be32_to_cpu(ext_hdr.h_version); + return 1; +} + +int rds_message_add_rdma_dest_extension(struct rds_header *hdr, u32 r_key, u32 offset) +{ + struct rds_ext_header_rdma_dest ext_hdr; + + ext_hdr.h_rdma_rkey = cpu_to_be32(r_key); + ext_hdr.h_rdma_offset = cpu_to_be32(offset); + return rds_message_add_extension(hdr, RDS_EXTHDR_RDMA_DEST, &ext_hdr, sizeof(ext_hdr)); +} +EXPORT_SYMBOL_GPL(rds_message_add_rdma_dest_extension); + +struct rds_message *rds_message_alloc(unsigned int nents, gfp_t gfp) +{ + struct rds_message *rm; + + rm = kzalloc(sizeof(struct rds_message) + + (nents * sizeof(struct scatterlist)), gfp); + if (!rm) + goto out; + +#ifdef CONFIG_DEBUG_SG +{ + unsigned int i; + + for (i=0; i < nents; i++) + rm->m_sg[i].sg_magic = SG_MAGIC; +} +#endif + atomic_set(&rm->m_refcount, 1); + INIT_LIST_HEAD(&rm->m_sock_item); + INIT_LIST_HEAD(&rm->m_conn_item); + spin_lock_init(&rm->m_rs_lock); + +out: + return rm; +} + +struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned int total_len) +{ + struct rds_message *rm; + unsigned int i; + + rm = rds_message_alloc(ceil(total_len, PAGE_SIZE), GFP_KERNEL); + if (rm == NULL) + return ERR_PTR(-ENOMEM); + + set_bit(RDS_MSG_PAGEVEC, &rm->m_flags); + rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len); + rm->m_nents = ceil(total_len, PAGE_SIZE); + + for (i = 0; i < rm->m_nents; ++i) { + sg_set_page(&rm->m_sg[i], + virt_to_page(page_addrs[i]), + PAGE_SIZE, 0); + } + + return rm; +} + +struct rds_message *rds_message_copy_from_user(struct iovec *first_iov, + size_t total_len) +{ + unsigned long to_copy; + unsigned long iov_off; + unsigned long sg_off; + struct rds_message *rm; + struct iovec *iov; + struct scatterlist *sg; + int ret; + + rm = rds_message_alloc(ceil(total_len, PAGE_SIZE), GFP_KERNEL); + if (rm == NULL) { + ret = -ENOMEM; + goto out; + } + + rm->m_inc.i_hdr.h_len = cpu_to_be32(total_len); + + /* + * now allocate and copy in the data payload. + */ + sg = rm->m_sg; + iov = first_iov; + iov_off = 0; + sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */ + + while (total_len) { + if (sg_page(sg) == NULL) { + ret = rds_page_remainder_alloc(sg, total_len, + GFP_HIGHUSER); + if (ret) + goto out; + rm->m_nents++; + sg_off = 0; + } + + while (iov_off == iov->iov_len) { + iov_off = 0; + iov++; + } + + to_copy = min(iov->iov_len - iov_off, sg->length - sg_off); + to_copy = min_t(size_t, to_copy, total_len); + + rdsdebug("copying %lu bytes from user iov [%p, %zu] + %lu to " + "sg [%p, %u, %u] + %lu\n", + to_copy, iov->iov_base, iov->iov_len, iov_off, + (void *)sg_page(sg), sg->offset, sg->length, sg_off); + + ret = rds_page_copy_from_user(sg_page(sg), sg->offset + sg_off, + iov->iov_base + iov_off, + to_copy); + if (ret) + goto out; + + iov_off += to_copy; + total_len -= to_copy; + sg_off += to_copy; + + if (sg_off == sg->length) + sg++; + } + + ret = 0; +out: + if (ret) { + if (rm) + rds_message_put(rm); + rm = ERR_PTR(ret); + } + return rm; +} + +int rds_message_inc_copy_to_user(struct rds_incoming *inc, + struct iovec *first_iov, size_t size) +{ + struct rds_message *rm; + struct iovec *iov; + struct scatterlist *sg; + unsigned long to_copy; + unsigned long iov_off; + unsigned long vec_off; + int copied; + int ret; + u32 len; + + rm = container_of(inc, struct rds_message, m_inc); + len = be32_to_cpu(rm->m_inc.i_hdr.h_len); + + iov = first_iov; + iov_off = 0; + sg = rm->m_sg; + vec_off = 0; + copied = 0; + + while (copied < size && copied < len) { + while (iov_off == iov->iov_len) { + iov_off = 0; + iov++; + } + + to_copy = min(iov->iov_len - iov_off, sg->length - vec_off); + to_copy = min_t(size_t, to_copy, size - copied); + to_copy = min_t(unsigned long, to_copy, len - copied); + + rdsdebug("copying %lu bytes to user iov [%p, %zu] + %lu to " + "sg [%p, %u, %u] + %lu\n", + to_copy, iov->iov_base, iov->iov_len, iov_off, + sg_page(sg), sg->offset, sg->length, vec_off); + + ret = rds_page_copy_to_user(sg_page(sg), sg->offset + vec_off, + iov->iov_base + iov_off, + to_copy); + if (ret) { + copied = ret; + break; + } + + iov_off += to_copy; + vec_off += to_copy; + copied += to_copy; + + if (vec_off == sg->length) { + vec_off = 0; + sg++; + } + } + + return copied; +} + +/* + * If the message is still on the send queue, wait until the transport + * is done with it. This is particularly important for RDMA operations. + */ +void rds_message_wait(struct rds_message *rm) +{ + wait_event(rds_message_flush_waitq, + !test_bit(RDS_MSG_MAPPED, &rm->m_flags)); +} + +void rds_message_unmapped(struct rds_message *rm) +{ + clear_bit(RDS_MSG_MAPPED, &rm->m_flags); + if (waitqueue_active(&rds_message_flush_waitq)) + wake_up(&rds_message_flush_waitq); +} +EXPORT_SYMBOL_GPL(rds_message_unmapped); + diff --git a/drivers/infiniband/ulp/rds/page.c b/drivers/infiniband/ulp/rds/page.c new file mode 100644 index 0000000..55c21ef --- /dev/null +++ b/drivers/infiniband/ulp/rds/page.c @@ -0,0 +1,222 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include + +#include "rds.h" + +struct rds_page_remainder { + struct page *r_page; + unsigned long r_offset; +}; + +DEFINE_PER_CPU(struct rds_page_remainder, rds_page_remainders) ____cacheline_aligned; + +/* + * returns 0 on success or -errno on failure. + * + * We don't have to worry about flush_dcache_page() as this only works + * with private pages. If, say, we were to do directed receive to pinned + * user pages we'd have to worry more about cache coherence. (Though + * the flush_dcache_page() in get_user_pages() would probably be enough). + */ +int rds_page_copy_user(struct page *page, unsigned long offset, + void __user *ptr, unsigned long bytes, + int to_user) +{ + unsigned long ret; + void *addr; + + if (to_user) + rds_stats_add(s_copy_to_user, bytes); + else + rds_stats_add(s_copy_from_user, bytes); + + addr = kmap_atomic(page, KM_USER0); + if (to_user) + ret = __copy_to_user_inatomic(ptr, addr + offset, bytes); + else + ret = __copy_from_user_inatomic(addr + offset, ptr, bytes); + kunmap_atomic(addr, KM_USER0); + + if (ret) { + addr = kmap(page); + if (to_user) + ret = copy_to_user(ptr, addr + offset, bytes); + else + ret = copy_from_user(addr + offset, ptr, bytes); + kunmap(page); + if (ret) + return -EFAULT; + } + + return 0; +} +EXPORT_SYMBOL_GPL(rds_page_copy_user); + +/* + * Message allocation uses this to build up regions of a message. + * + * @bytes - the number of bytes needed. + * @gfp - the waiting behaviour of the allocation + * + * @gfp is always ored with __GFP_HIGHMEM. Callers must be prepared to + * kmap the pages, etc. + * + * If @bytes is at least a full page then this just returns a page from + * alloc_page(). + * + * If @bytes is a partial page then this stores the unused region of the + * page in a per-cpu structure. Future partial-page allocations may be + * satisfied from that cached region. This lets us waste less memory on + * small allocations with minimal complexity. It works because the transmit + * path passes read-only page regions down to devices. They hold a page + * reference until they are done with the region. + */ +int rds_page_remainder_alloc(struct scatterlist *scat, unsigned long bytes, + gfp_t gfp) +{ + struct rds_page_remainder *rem; + unsigned long flags; + struct page *page; + int ret; + + gfp |= __GFP_HIGHMEM; + + /* jump straight to allocation if we're trying for a huge page */ + if (bytes >= PAGE_SIZE) { + page = alloc_page(gfp); + if (page == NULL) { + ret = -ENOMEM; + } else { + sg_set_page(scat, page, PAGE_SIZE, 0); + ret = 0; + } + goto out; + } + + rem = &per_cpu(rds_page_remainders, get_cpu()); + local_irq_save(flags); + + while (1) { + /* avoid a tiny region getting stuck by tossing it */ + if (rem->r_page && bytes > (PAGE_SIZE - rem->r_offset)) { + rds_stats_inc(s_page_remainder_miss); + __free_page(rem->r_page); + rem->r_page = NULL; + } + + /* hand out a fragment from the cached page */ + if (rem->r_page && bytes <= (PAGE_SIZE - rem->r_offset)) { + sg_set_page(scat, rem->r_page, bytes, rem->r_offset); + get_page(sg_page(scat)); + + if (rem->r_offset != 0) + rds_stats_inc(s_page_remainder_hit); + + rem->r_offset += bytes; + if (rem->r_offset == PAGE_SIZE) { + __free_page(rem->r_page); + rem->r_page = NULL; + } + ret = 0; + break; + } + + /* alloc if there is nothing for us to use */ + local_irq_restore(flags); + put_cpu(); + + page = alloc_page(gfp); + + rem = &per_cpu(rds_page_remainders, get_cpu()); + local_irq_save(flags); + + if (page == NULL) { + ret = -ENOMEM; + break; + } + + /* did someone race to fill the remainder before us? */ + if (rem->r_page) { + __free_page(page); + continue; + } + + /* otherwise install our page and loop around to alloc */ + rem->r_page = page; + rem->r_offset = 0; + } + + local_irq_restore(flags); + put_cpu(); +out: + rdsdebug("bytes %lu ret %d %p %u %u\n", bytes, ret, + ret ? NULL : sg_page(scat), ret ? 0 : scat->offset, + ret ? 0 : scat->length); + return ret; +} + +static int rds_page_remainder_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + struct rds_page_remainder *rem; + long cpu = (long)hcpu; + + rem = &per_cpu(rds_page_remainders, cpu); + + rdsdebug("cpu %ld action 0x%lx\n", cpu, action); + + switch (action) { + case CPU_DEAD: + if (rem->r_page) + __free_page(rem->r_page); + rem->r_page = NULL; + break; + } + + return 0; +} + +static struct notifier_block rds_page_remainder_nb = { + .notifier_call = rds_page_remainder_cpu_notify, +}; + +void rds_page_exit(void) +{ + int i; + + for_each_possible_cpu(i) + rds_page_remainder_cpu_notify(&rds_page_remainder_nb, + (unsigned long)CPU_DEAD, + (void *)(long)i); +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:55 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:55 -0800 Subject: [ofa-general] [PATCH 18/21] RDS/IB: Stats and sysctls In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-19-git-send-email-andy.grover@oracle.com> IB-specific stats and sysctls. Signed-off-by: Andy Grover --- drivers/infiniband/ulp/rds/ib_stats.c | 95 ++++++++++++++++++++++ drivers/infiniband/ulp/rds/ib_sysctl.c | 137 ++++++++++++++++++++++++++++++++ 2 files changed, 232 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/ib_stats.c create mode 100644 drivers/infiniband/ulp/rds/ib_sysctl.c diff --git a/drivers/infiniband/ulp/rds/ib_stats.c b/drivers/infiniband/ulp/rds/ib_stats.c new file mode 100644 index 0000000..02e3e3d --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_stats.c @@ -0,0 +1,95 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "rds.h" +#include "ib.h" + +DEFINE_PER_CPU(struct rds_ib_statistics, rds_ib_stats) ____cacheline_aligned; + +static char *rds_ib_stat_names[] = { + "ib_connect_raced", + "ib_listen_closed_stale", + "ib_tx_cq_call", + "ib_tx_cq_event", + "ib_tx_ring_full", + "ib_tx_throttle", + "ib_tx_sg_mapping_failure", + "ib_tx_stalled", + "ib_tx_credit_updates", + "ib_rx_cq_call", + "ib_rx_cq_event", + "ib_rx_ring_empty", + "ib_rx_refill_from_cq", + "ib_rx_refill_from_thread", + "ib_rx_alloc_limit", + "ib_rx_credit_updates", + "ib_ack_sent", + "ib_ack_send_failure", + "ib_ack_send_delayed", + "ib_ack_send_piggybacked", + "ib_ack_received", + "ib_rdma_mr_alloc", + "ib_rdma_mr_free", + "ib_rdma_mr_used", + "ib_rdma_mr_pool_flush", + "ib_rdma_mr_pool_wait", + "ib_rdma_mr_pool_depleted", +}; + +unsigned int rds_ib_stats_info_copy(struct rds_info_iterator *iter, + unsigned int avail) +{ + struct rds_ib_statistics stats = {0, }; + uint64_t *src; + uint64_t *sum; + size_t i; + int cpu; + + if (avail < ARRAY_SIZE(rds_ib_stat_names)) + goto out; + + for_each_online_cpu(cpu) { + src = (uint64_t *)&(per_cpu(rds_ib_stats, cpu)); + sum = (uint64_t *)&stats; + for (i = 0; i < sizeof(stats) / sizeof(uint64_t); i++) + *(sum++) += *(src++); + } + + rds_stats_info_copy(iter, (uint64_t *)&stats, rds_ib_stat_names, + ARRAY_SIZE(rds_ib_stat_names)); +out: + return ARRAY_SIZE(rds_ib_stat_names); +} diff --git a/drivers/infiniband/ulp/rds/ib_sysctl.c b/drivers/infiniband/ulp/rds/ib_sysctl.c new file mode 100644 index 0000000..d87830d --- /dev/null +++ b/drivers/infiniband/ulp/rds/ib_sysctl.c @@ -0,0 +1,137 @@ +/* + * Copyright (c) 2006 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ +#include +#include +#include + +#include "ib.h" + +static struct ctl_table_header *rds_ib_sysctl_hdr; + +unsigned long rds_ib_sysctl_max_send_wr = RDS_IB_DEFAULT_SEND_WR; +unsigned long rds_ib_sysctl_max_recv_wr = RDS_IB_DEFAULT_RECV_WR; +unsigned long rds_ib_sysctl_max_recv_allocation = (128 * 1024 * 1024) / RDS_FRAG_SIZE; +static unsigned long rds_ib_sysctl_max_wr_min = 1; +/* hardware will fail CQ creation long before this */ +static unsigned long rds_ib_sysctl_max_wr_max = (u32)~0; + +unsigned long rds_ib_sysctl_max_unsig_wrs = 16; +static unsigned long rds_ib_sysctl_max_unsig_wr_min = 1; +static unsigned long rds_ib_sysctl_max_unsig_wr_max = 64; + +unsigned long rds_ib_sysctl_max_unsig_bytes = (16 << 20); +static unsigned long rds_ib_sysctl_max_unsig_bytes_min = 1; +static unsigned long rds_ib_sysctl_max_unsig_bytes_max = ~0UL; + +unsigned int rds_ib_sysctl_flow_control = 1; + +ctl_table rds_ib_sysctl_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_send_wr", + .data = &rds_ib_sysctl_max_send_wr, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_ib_sysctl_max_wr_min, + .extra2 = &rds_ib_sysctl_max_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_recv_wr", + .data = &rds_ib_sysctl_max_recv_wr, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_ib_sysctl_max_wr_min, + .extra2 = &rds_ib_sysctl_max_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unsignaled_wr", + .data = &rds_ib_sysctl_max_unsig_wrs, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_ib_sysctl_max_unsig_wr_min, + .extra2 = &rds_ib_sysctl_max_unsig_wr_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_unsignaled_bytes", + .data = &rds_ib_sysctl_max_unsig_bytes, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + .extra1 = &rds_ib_sysctl_max_unsig_bytes_min, + .extra2 = &rds_ib_sysctl_max_unsig_bytes_max, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "max_recv_allocation", + .data = &rds_ib_sysctl_max_recv_allocation, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_minmax, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "flow_control", + .data = &rds_ib_sysctl_flow_control, + .maxlen = sizeof(rds_ib_sysctl_flow_control), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { .ctl_name = 0} +}; + +static struct ctl_path rds_ib_sysctl_path[] = { + { .procname = "net", .ctl_name = CTL_NET, }, + { .procname = "rds", .ctl_name = CTL_UNNUMBERED, }, + { .procname = "ib", .ctl_name = CTL_UNNUMBERED, }, + { } +}; + +void rds_ib_sysctl_exit(void) +{ + if (rds_ib_sysctl_hdr) + unregister_sysctl_table(rds_ib_sysctl_hdr); +} + +int __init rds_ib_sysctl_init(void) +{ + rds_ib_sysctl_hdr = register_sysctl_paths(rds_ib_sysctl_path, rds_ib_sysctl_table); + if (rds_ib_sysctl_hdr == NULL) + return -ENOMEM; + return 0; +} -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:56 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:56 -0800 Subject: [ofa-general] [PATCH 19/21] RDS: Documentation In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-20-git-send-email-andy.grover@oracle.com> This file documents the specifics of the RDS sockets API, as well as covering some of the details of its internal implementation. Signed-off-by: Andy Grover --- Documentation/networking/rds.txt | 356 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 356 insertions(+), 0 deletions(-) create mode 100644 Documentation/networking/rds.txt diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt new file mode 100644 index 0000000..c67077c --- /dev/null +++ b/Documentation/networking/rds.txt @@ -0,0 +1,356 @@ + +Overview +======== + +This readme tries to provide some background on the hows and whys of RDS, +and will hopefully help you find your way around the code. + +In addition, please see this email about RDS origins: +http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html + +RDS Architecture +================ + +RDS provides reliable, ordered datagram delivery by using a single +reliable connection between any two nodes in the cluster. This allows +applications to use a single socket to talk to any other process in the +cluster - so in a cluster with N processes you need N sockets, in contrast +to N*N if you use a connection-oriented socket transport like TCP. + +RDS is not Infiniband-specific; it was designed to support different +transports. The current implementation used to support RDS over TCP as well +as IB. Work is in progress to support RDS over iWARP, and using DCE to +guarantee no dropped packets on Ethernet, it may be possible to use RDS over +UDP in the future. + +The high-level semantics of RDS from the application's point of view are + + * Addressing + RDS uses IPv4 addresses and 16bit port numbers to identify + the end point of a connection. All socket operations that involve + passing addresses between kernel and user space generally + use a struct sockaddr_in. + + The fact that IPv4 addresses are used does not mean the underlying + transport has to be IP-based. In fact, RDS over IB uses a + reliable IB connection; the IP address is used exclusively to + locate the remote node's GID (by ARPing for the given IP). + + The port space is entirely independent of UDP, TCP or any other + protocol. + + * Socket interface + RDS sockets work *mostly* as you would expect from a BSD + socket. The next section will cover the details. At any rate, + all I/O is performed through the standard BSD socket API. + Some additions like zerocopy support are implemented through + control messages, while other extensions use the getsockopt/ + setsockopt calls. + + Sockets must be bound before you can send or receive data. + This is needed because binding also selects a transport and + attaches it to the socket. Once bound, the transport assignment + does not change. RDS will tolerate IPs moving around (eg in + a active-active HA scenario), but only as long as the address + doesn't move to a different transport. + + * sysctls + RDS supports a number of sysctls in /proc/sys/net/rds + + +Socket Interface +================ + + AF_RDS, PF_RDS, SOL_RDS + These constants haven't been assigned yet, because RDS isn't in + mainline yet. Currently, the kernel module assigns some constant + and publishes it to user space through two sysctl files + /proc/sys/net/rds/pf_rds + /proc/sys/net/rds/sol_rds + + fd = socket(PF_RDS, SOCK_SEQPACKET, 0); + This creates a new, unbound RDS socket. + + setsockopt(SOL_SOCKET): send and receive buffer size + RDS honors the send and receive buffer size socket options. + You are not allowed to queue more than SO_SNDSIZE bytes to + a socket. A message is queued when sendmsg is called, and + it leaves the queue when the remote system acknowledges + its arrival. + + The SO_RCVSIZE option controls the maximum receive queue length. + This is a soft limit rather than a hard limit - RDS will + continue to accept and queue incoming messages, even if that + takes the queue length over the limit. However, it will also + mark the port as "congested" and send a congestion update to + the source node. The source node is supposed to throttle any + processes sending to this congested port. + + bind(fd, &sockaddr_in, ...) + This binds the socket to a local IP address and port, and a + transport. + + sendmsg(fd, ...) + Sends a message to the indicated recipient. The kernel will + transparently establish the underlying reliable connection + if it isn't up yet. + + An attempt to send a message that exceeds SO_SNDSIZE will + return with -EMSGSIZE + + An attempt to send a message that would take the total number + of queued bytes over the SO_SNDSIZE threshold will return + EAGAIN. + + An attempt to send a message to a destination that is marked + as "congested" will return ENOBUFS. + + recvmsg(fd, ...) + Receives a message that was queued to this socket. The sockets + recv queue accounting is adjusted, and if the queue length + drops below SO_SNDSIZE, the port is marked uncongested, and + a congestion update is sent to all peers. + + Applications can ask the RDS kernel module to receive + notifications via control messages (for instance, there is a + notification when a congestion update arrived, or when a RDMA + operation completes). These notifications are received through + the msg.msg_control buffer of struct msghdr. The format of the + messages is described in manpages. + + poll(fd) + RDS supports the poll interface to allow the application + to implement async I/O. + + POLLIN handling is pretty straightforward. When there's an + incoming message queued to the socket, or a pending notification, + we signal POLLIN. + + POLLOUT is a little harder. Since you can essentially send + to any destination, RDS will always signal POLLOUT as long as + there's room on the send queue (ie the number of bytes queued + is less than the sendbuf size). + + However, the kernel will refuse to accept messages to + a destination marked congested - in this case you will loop + forever if you rely on poll to tell you what to do. + This isn't a trivial problem, but applications can deal with + this - by using congestion notifications, and by checking for + ENOBUFS errors returned by sendmsg. + + setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) + This allows the application to discard all messages queued to a + specific destination on this particular socket. + + This allows the application to cancel outstanding messages if + it detects a timeout. For instance, if it tried to send a message, + and the remote host is unreachable, RDS will keep trying forever. + The application may decide it's not worth it, and cancel the + operation. In this case, it would use RDS_CANCEL_SENT_TO to + nuke any pending messages. + + +RDMA for RDS +============ + + see rds-rdma(7) manpage (available in rds-tools) + + +Congestion Notifications +======================== + + see rds(7) manpage + + +RDS Protocol +============ + + Message header + + The message header is a 'struct rds_header' (see rds.h): + Fields: + h_sequence: + per-packet sequence number + h_ack: + piggybacked acknowledgment of last packet received + h_len: + length of data, not including header + h_sport: + source port + h_dport: + destination port + h_flags: + CONG_BITMAP - this is a congestion update bitmap + ACK_REQUIRED - receiver must ack this packet + RETRANSMITTED - packet has previously been sent + h_credit: + indicate to other end of connection that + it has more credits available (i.e. there is + more send room) + h_padding[4]: + unused, for future use + h_csum: + header checksum + h_exthdr: + optional data can be passed here. This is currently used for + passing RDMA-related information. + + ACK and retransmit handling + + One might think that with reliable IB connections you wouldn't need + to ack messages that have been received. The problem is that IB + hardware generates an ack message before it has DMAed the message + into memory. This creates a potential message loss if the HCA is + disabled for any reason between when it sends the ack and before + the message is DMAed and processed. This is only a potential issue + if another HCA is available for fail-over. + + Sending an ack immediately would allow the sender to free the sent + message from their send queue quickly, but could cause excessive + traffic to be used for acks. RDS piggybacks acks on sent data + packets. Ack-only packets are reduced by only allowing one to be + in flight at a time, and by the sender only asking for acks when + its send buffers start to fill up. All retransmissions are also + acked. + + Flow Control + + RDS's IB transport uses a credit-based mechanism to verify that + there is space in the peer's receive buffers for more data. This + eliminates the need for hardware retries on the connection. + + Congestion + + Messages waiting in the receive queue on the receiving socket + are accounted against the sockets SO_RCVBUF option value. Only + the payload bytes in the message are accounted for. If the + number of bytes queued equals or exceeds rcvbuf then the socket + is congested. All sends attempted to this socket's address + should return block or return -EWOULDBLOCK. + + Applications are expected to be reasonably tuned such that this + situation very rarely occurs. An application encountering this + "back-pressure" is considered a bug. + + This is implemented by having each node maintain bitmaps which + indicate which ports on bound addresses are congested. As the + bitmap changes it is sent through all the connections which + terminate in the local address of the bitmap which changed. + + The bitmaps are allocated as connections are brought up. This + avoids allocation in the interrupt handling path which queues + sages on sockets. The dense bitmaps let transports send the + entire bitmap on any bitmap change reasonably efficiently. This + is much easier to implement than some finer-grained + communication of per-port congestion. The sender does a very + inexpensive bit test to test if the port it's about to send to + is congested or not. + + +RDS Transport Layer +================== + + As mentioned above, RDS is not IB-specific. Its code is divided + into a general RDS layer and a transport layer. + + The general layer handles the socket API, congestion handling, + loopback, stats, usermem pinning, and the connection state machine. + + The transport layer handles the details of the transport. The IB + transport, for example, handles all the queue pairs, work requests, + CM event handlers, and other Infiniband details. + + +RDS Kernel Structures +===================== + + struct rds_message + aka possibly "rds_outgoing", the generic RDS layer copies data to + be sent and sets header fields as needed, based on the socket API. + This is then queued for the individual connection and sent by the + connection's transport. + struct rds_incoming + a generic struct referring to incoming data that can be handed from + the transport to the general code and queued by the general code + while the socket is awoken. It is then passed back to the transport + code to handle the actual copy-to-user. + struct rds_socket + per-socket information + struct rds_connection + per-connection information + struct rds_transport + pointers to transport-specific functions + struct rds_statistics + non-transport-specific statistics + struct rds_cong_map + wraps the raw congestion bitmap, contains rbnode, waitq, etc. + +Connection management +===================== + + Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and + ERROR states. + + The first time an attempt is made by an RDS socket to send data to + a node, a connection is allocated and connected. That connection is + then maintained forever -- if there are transport errors, the + connection will be dropped and re-established. + + Dropping a connection while packets are queued will cause queued or + partially-sent datagrams to be retransmitted when the connection is + re-established. + + +The send path +============= + + rds_sendmsg() + struct rds_message built from incoming data + CMSGs parsed (e.g. RDMA ops) + transport connection alloced and connected if not already + rds_message placed on send queue + send worker awoken + rds_send_worker() + calls rds_send_xmit() until queue is empty + rds_send_xmit() + transmits congestion map if one is pending + may set ACK_REQUIRED + calls transport to send either non-RDMA or RDMA message + (RDMA ops never retransmitted) + rds_ib_xmit() + allocs work requests from send ring + adds any new send credits available to peer (h_credits) + maps the rds_message's sg list + piggybacks ack + populates work requests + post send to connection's queue pair + +The recv path +============= + + rds_ib_recv_cq_comp_handler() + looks at write completions + unmaps recv buffer from device + no errors, call rds_ib_process_recv() + refill recv ring + rds_ib_process_recv() + validate header checksum + copy header to rds_ib_incoming struct if start of a new datagram + add to ibinc's fraglist + if competed datagram: + update cong map if datagram was cong update + call rds_recv_incoming() otherwise + note if ack is required + rds_recv_incoming() + drop duplicate packets + respond to pings + find the sock associated with this datagram + add to sock queue + wake up sock + do some congestion calculations + rds_recvmsg + copy data into user iovec + handle CMSGs + return to application + + -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:57 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:57 -0800 Subject: [ofa-general] [PATCH 20/21] RDS: Kconfig and Makefile In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-21-git-send-email-andy.grover@oracle.com> Add RDS Kconfig and Makefile, and modify infiniband's to add us to the build. Signed-off-by: Andy Grover --- drivers/infiniband/Kconfig | 2 ++ drivers/infiniband/Makefile | 1 + drivers/infiniband/ulp/rds/Kconfig | 13 +++++++++++++ drivers/infiniband/ulp/rds/Makefile | 13 +++++++++++++ 4 files changed, 29 insertions(+), 0 deletions(-) create mode 100644 drivers/infiniband/ulp/rds/Kconfig create mode 100644 drivers/infiniband/ulp/rds/Makefile diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index dd0db67..1cba524 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -54,4 +54,6 @@ source "drivers/infiniband/ulp/srp/Kconfig" source "drivers/infiniband/ulp/iser/Kconfig" +source "drivers/infiniband/ulp/rds/Kconfig" + endif # INFINIBAND diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile index ed35e44..39d0203 100644 --- a/drivers/infiniband/Makefile +++ b/drivers/infiniband/Makefile @@ -9,3 +9,4 @@ obj-$(CONFIG_INFINIBAND_NES) += hw/nes/ obj-$(CONFIG_INFINIBAND_IPOIB) += ulp/ipoib/ obj-$(CONFIG_INFINIBAND_SRP) += ulp/srp/ obj-$(CONFIG_INFINIBAND_ISER) += ulp/iser/ +obj-$(CONFIG_INFINIBAND_ISER) += ulp/rds/ diff --git a/drivers/infiniband/ulp/rds/Kconfig b/drivers/infiniband/ulp/rds/Kconfig new file mode 100644 index 0000000..bbc2ba4 --- /dev/null +++ b/drivers/infiniband/ulp/rds/Kconfig @@ -0,0 +1,13 @@ + +config INFINIBAND_RDS + tristate "Reliable Datagram Sockets (RDS) (EXPERIMENTAL)" + depends on EXPERIMENTAL + ---help--- + RDS provides reliable, sequenced delivery of datagrams + over Infiniband. + +config INFINIBAND_RDS_DEBUG + bool "Debugging messages" + depends on INFINIBAND_RDS + default n + diff --git a/drivers/infiniband/ulp/rds/Makefile b/drivers/infiniband/ulp/rds/Makefile new file mode 100644 index 0000000..d470550 --- /dev/null +++ b/drivers/infiniband/ulp/rds/Makefile @@ -0,0 +1,13 @@ +obj-$(CONFIG_INFINIBAND_RDS) += ib_rds.o + +ib_rds-y := af_rds.o bind.o cong.o connection.o info.o message.o \ + recv.o send.o stats.o sysctl.o threads.o transport.o \ + loop.o page.o rdma.o + +ib_rds-y += ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ + ib_sysctl.o ib_rdma.o + +ifeq ($(CONFIG_INFINIBAND_RDS_DEBUG), y) +EXTRA_CFLAGS += -DDEBUG +endif + -- 1.5.6.3 From andy.grover at oracle.com Mon Jan 26 18:17:58 2009 From: andy.grover at oracle.com (Andy Grover) Date: Mon, 26 Jan 2009 18:17:58 -0800 Subject: [ofa-general] [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <1233022678-9259-22-git-send-email-andy.grover@oracle.com> RDS is a reliable datagram protocol used for IPC on Oracle database clusters. This adds address and protocol family numbers for it. Signed-off-by: Andy Grover --- include/linux/socket.h | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/include/linux/socket.h b/include/linux/socket.h index 20fc4bb..fda91af 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -191,7 +191,8 @@ struct ucred { #define AF_RXRPC 33 /* RxRPC sockets */ #define AF_ISDN 34 /* mISDN sockets */ #define AF_PHONET 35 /* Phonet sockets */ -#define AF_MAX 36 /* For now.. */ +#define AF_RDS 36 /* RDS sockets */ +#define AF_MAX 37 /* For now.. */ /* Protocol families, same as address families. */ #define PF_UNSPEC AF_UNSPEC @@ -229,6 +230,7 @@ struct ucred { #define PF_RXRPC AF_RXRPC #define PF_ISDN AF_ISDN #define PF_PHONET AF_PHONET +#define PF_RDS AF_RDS #define PF_MAX AF_MAX /* Maximum queue length specifiable by listen. */ -- 1.5.6.3 From shemminger at vyatta.com Mon Jan 26 19:46:19 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Mon, 26 Jan 2009 19:46:19 -0800 Subject: [ofa-general] Re: [PATCH 01/21] RDS: Socket interface In-Reply-To: <1233022678-9259-2-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-2-git-send-email-andy.grover@oracle.com> Message-ID: <20090126194619.02c9557e@extreme> On Mon, 26 Jan 2009 18:17:38 -0800 Andy Grover wrote: > Implement the RDS (Reliable Datagram Sockets) interface. > > Signed-off-by: Andy Grover > --- > drivers/infiniband/ulp/rds/af_rds.c | 677 +++++++++++++++++++++++++++++++++++ > drivers/infiniband/ulp/rds/bind.c | 202 +++++++++++ > 2 files changed, 879 insertions(+), 0 deletions(-) > create mode 100644 drivers/infiniband/ulp/rds/af_rds.c > create mode 100644 drivers/infiniband/ulp/rds/bind.c > > diff --git a/drivers/infiniband/ulp/rds/af_rds.c b/drivers/infiniband/ulp/rds/af_rds.c > new file mode 100644 > index 0000000..7158438 > --- /dev/null > +++ b/drivers/infiniband/ulp/rds/af_rds.c > @@ -0,0 +1,677 @@ > +/* > + * Copyright (c) 2006 Oracle. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "rds.h" > +#include "rdma.h" > + > +static int enable_rdma = 1; > + > +module_param(enable_rdma, int, 0444); > +MODULE_PARM_DESC(enable_rdma, " Enable RDMA operations support"); Module parameter is a cop-out for not doing it right. > +/* this is just used for stats gathering :/ */ Then I would think a high speed protocol would use per-cpu and/or rcu. > +static DEFINE_SPINLOCK(rds_sock_lock); > +static unsigned long rds_sock_count; > +static LIST_HEAD(rds_sock_list); > +DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq); > + > +/* > + * This is called as the final descriptor referencing this socket is closed. > + * We have to unbind the socket so that another socket can be bound to the > + * address it was using. > + * > + * We have to be careful about racing with the incoming path. sock_orphan() > + * sets SOCK_DEAD and we use that as an indicator to the rx path that new > + * messages shouldn't be queued. > + */ > +static int rds_release(struct socket *sock) > +{ > + struct sock *sk = sock->sk; > + struct rds_sock *rs; > + unsigned long flags; > + > + if (sk == NULL) > + goto out; > + > + rs = rds_sk_to_rs(sk); > + > + sock_orphan(sk); > + /* Note - rds_clear_recv_queue grabs rs_recv_lock, so > + * that ensures the recv path has completed messing > + * with the socket. */ > + rds_clear_recv_queue(rs); > + rds_cong_remove_socket(rs); > + rds_remove_bound(rs); > + rds_send_drop_to(rs, NULL); > + rds_rdma_drop_keys(rs); > + rds_notify_queue_get(rs, NULL); > + > + spin_lock_irqsave(&rds_sock_lock, flags); > + list_del_init(&rs->rs_item); > + rds_sock_count--; > + spin_unlock_irqrestore(&rds_sock_lock, flags); > + > + sock->sk = NULL; > + sock_put(sk); > +out: > + return 0; > +} > + > +/* > + * Careful not to race with rds_release -> sock_orphan which clears sk_sleep. > + * _bh() isn't OK here, we're called from interrupt handlers. It's probably OK > + * to wake the waitqueue after sk_sleep is clear as we hold a sock ref, but > + * this seems more conservative. > + * NB - normally, one would use sk_callback_lock for this, but we can > + * get here from interrupts, whereas the network code grabs sk_callback_lock > + * with _lock_bh only - so relying on sk_callback_lock introduces livelocks. > + */ > +void rds_wake_sk_sleep(struct rds_sock *rs) > +{ > + unsigned long flags; > + > + read_lock_irqsave(&rs->rs_recv_lock, flags); > + __rds_wake_sk_sleep(rds_rs_to_sk(rs)); > + read_unlock_irqrestore(&rs->rs_recv_lock, flags); > +} > + > +static int rds_getname(struct socket *sock, struct sockaddr *uaddr, > + int *uaddr_len, int peer) > +{ > + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; > + struct rds_sock *rs = rds_sk_to_rs(sock->sk); > + > + memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); > + > + /* racey, don't care */ > + if (peer) { > + if (!rs->rs_conn_addr) > + return -ENOTCONN; > + > + sin->sin_port = rs->rs_conn_port; > + sin->sin_addr.s_addr = rs->rs_conn_addr; > + } else { > + sin->sin_port = rs->rs_bound_port; > + sin->sin_addr.s_addr = rs->rs_bound_addr; > + } > + > + sin->sin_family = AF_INET; > + > + *uaddr_len = sizeof(*sin); > + return 0; > +} > + > +/* > + * RDS' poll is without a doubt the least intuitive part of the interface, > + * as POLLIN and POLLOUT do not behave entirely as you would expect from > + * a network protocol. > + * > + * POLLIN is asserted if > + * - there is data on the receive queue. > + * - to signal that a previously congested destination may have become > + * uncongested > + * - A notification has been queued to the socket (this can be a congestion > + * update, or a RDMA completion). > + * > + * POLLOUT is asserted if there is room on the send queue. This does not mean > + * however, that the next sendmsg() call will succeed. If the application tries > + * to send to a congested destination, the system call may still fail (and > + * return ENOBUFS). > + */ > +static unsigned int rds_poll(struct file *file, struct socket *sock, > + poll_table *wait) > +{ > + struct sock *sk = sock->sk; > + struct rds_sock *rs = rds_sk_to_rs(sk); > + unsigned int mask = 0; > + unsigned long flags; > + > + poll_wait(file, sk->sk_sleep, wait); > + > + poll_wait(file, &rds_poll_waitq, wait); > + > + read_lock_irqsave(&rs->rs_recv_lock, flags); > + if (!rs->rs_cong_monitor) { > + /* When a congestion map was updated, we signal POLLIN for > + * "historical" reasons. Applications can also poll for > + * WRBAND instead. */ > + if (rds_cong_updated_since(&rs->rs_cong_track)) > + mask |= (POLLIN | POLLRDNORM | POLLWRBAND); > + } else { > + spin_lock(&rs->rs_lock); > + if (rs->rs_cong_notify) > + mask |= (POLLIN | POLLRDNORM); > + spin_unlock(&rs->rs_lock); > + } > + if (!list_empty(&rs->rs_recv_queue) > + || !list_empty(&rs->rs_notify_queue)) > + mask |= (POLLIN | POLLRDNORM); > + if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) > + mask |= (POLLOUT | POLLWRNORM); > + read_unlock_irqrestore(&rs->rs_recv_lock, flags); > + > + return mask; > +} > + > +static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) > +{ > +#ifdef KERNEL_HAS_CORE_CALLING_DEV_IOCTL > + return -ENOIOCTLCMD; > +#endif /* KERNEL_HAS_CORE_CALLING_DEV_IOCTL */ > +#ifndef KERNEL_HAS_CORE_CALLING_DEV_IOCTL > + return dev_ioctl(cmd, (void __user *)arg); > +#endif /* KERNEL_HAS_CORE_CALLING_DEV_IOCTL */ > +} > + > +static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval, > + int len) > +{ > + struct sockaddr_in sin; > + int ret = 0; > + > + /* racing with another thread binding seems ok here */ > + if (rs->rs_bound_addr == 0) { > + ret = -ENOTCONN; /* XXX not a great errno */ > + goto out; > + } > + > + if (len < sizeof(struct sockaddr_in)) { > + ret = -EINVAL; > + goto out; > + } > + > + if (copy_from_user(&sin, optval, sizeof(sin))) { > + ret = -EFAULT; > + goto out; > + } > + > + rds_send_drop_to(rs, &sin); > +out: > + return ret; > +} > + > +static int rds_set_bool_option(unsigned char *optvar, char __user *optval, > + int optlen) > +{ > + int value; > + > + if (optlen < sizeof(int)) > + return -EINVAL; > + if (get_user(value, (int __user *) optval)) > + return -EFAULT; > + *optvar = !!value; > + return 0; > +} > + > +static int rds_cong_monitor(struct rds_sock *rs, char __user *optval, > + int optlen) > +{ > + int ret; > + > + ret = rds_set_bool_option(&rs->rs_cong_monitor, optval, optlen); > + if (ret == 0) { > + if (rs->rs_cong_monitor) { > + rds_cong_add_socket(rs); > + } else { > + rds_cong_remove_socket(rs); > + rs->rs_cong_mask = 0; > + rs->rs_cong_notify = 0; > + } > + } > + return ret; > +} > + > +static int rds_setsockopt(struct socket *sock, int level, int optname, > + char __user *optval, int optlen) > +{ > + struct rds_sock *rs = rds_sk_to_rs(sock->sk); > + int ret; > + > + if (level != SOL_RDS) { > + ret = -ENOPROTOOPT; > + goto out; > + } > + > + switch (optname) { > + case RDS_CANCEL_SENT_TO: > + ret = rds_cancel_sent_to(rs, optval, optlen); > + break; > + case RDS_GET_MR: > + if (enable_rdma) > + ret = rds_get_mr(rs, optval, optlen); > + else > + ret = -EOPNOTSUPP; > + break; > + case RDS_FREE_MR: > + if (enable_rdma) > + ret = rds_free_mr(rs, optval, optlen); > + else > + ret = -EOPNOTSUPP; > + break; > + case RDS_RECVERR: > + ret = rds_set_bool_option(&rs->rs_recverr, optval, optlen); > + break; > + case RDS_CONG_MONITOR: > + ret = rds_cong_monitor(rs, optval, optlen); > + break; > + default: > + ret = -ENOPROTOOPT; > + } > +out: > + return ret; > +} > + > +static int rds_getsockopt(struct socket *sock, int level, int optname, > + char __user *optval, int __user *optlen) > +{ > + struct rds_sock *rs = rds_sk_to_rs(sock->sk); > + int ret = -ENOPROTOOPT, len; > + > + if (level != SOL_RDS) > + goto out; > + > + if (get_user(len, optlen)) { > + ret = -EFAULT; > + goto out; > + } > + > + switch (optname) { > + case RDS_INFO_FIRST ... RDS_INFO_LAST: > + ret = rds_info_getsockopt(sock, optname, optval, > + optlen); > + break; > + > + case RDS_RECVERR: > + if (len < sizeof(int)) > + ret = -EINVAL; > + else > + if (put_user(rs->rs_recverr, (int __user *) optval) > + || put_user(sizeof(int), optlen)) > + ret = -EFAULT; > + else > + ret = 0; > + break; > + default: > + break; > + } > + > +out: > + return ret; > + > +} > + > +static int rds_connect(struct socket *sock, struct sockaddr *uaddr, > + int addr_len, int flags) > +{ > + struct sock *sk = sock->sk; > + struct sockaddr_in *sin = (struct sockaddr_in *)uaddr; > + struct rds_sock *rs = rds_sk_to_rs(sk); > + int ret = 0; > + > + lock_sock(sk); > + > + if (addr_len != sizeof(struct sockaddr_in)) { > + ret = -EINVAL; > + goto out; > + } > + > + if (sin->sin_family != AF_INET) { > + ret = -EAFNOSUPPORT; > + goto out; > + } > + > + if (sin->sin_addr.s_addr == htonl(INADDR_ANY)) { > + ret = -EDESTADDRREQ; > + goto out; > + } > + > + rs->rs_conn_addr = sin->sin_addr.s_addr; > + rs->rs_conn_port = sin->sin_port; > + > +out: > + release_sock(sk); > + return ret; > +} > + > +#ifdef KERNEL_HAS_PROTO_REGISTER > +static struct proto rds_proto = { > + .name = "RDS", > + .owner = THIS_MODULE, > + .obj_size = sizeof(struct rds_sock), > +}; > +#endif /* KERNEL_HAS_PROTO_REGISTER */ > + > +static struct proto_ops rds_proto_ops = { > + .family = AF_RDS, > + .owner = THIS_MODULE, > + .release = rds_release, > + .bind = rds_bind, > + .connect = rds_connect, > + .socketpair = sock_no_socketpair, > + .accept = sock_no_accept, > + .getname = rds_getname, > + .poll = rds_poll, > + .ioctl = rds_ioctl, > + .listen = sock_no_listen, > + .shutdown = sock_no_shutdown, > + .setsockopt = rds_setsockopt, > + .getsockopt = rds_getsockopt, > + .sendmsg = rds_sendmsg, > + .recvmsg = rds_recvmsg, > + .mmap = sock_no_mmap, > + .sendpage = sock_no_sendpage, > +}; > + > +#ifndef KERNEL_HAS_PROTO_REGISTER > +static struct sock *sk_alloc_compat(int pf, gfp_t gfp, struct proto *prot, > + int zero_it) > +{ > + struct rds_sock *rs; > + > + sk = sk_alloc(pf, gfp, prot, zero_it); > + if (sk == NULL) > + return NULL; > + > + rs = kcalloc(1, sizeof(struct rds_sock), GFP_ATOMIC); > + if (rs == NULL) { > + sk_free(sk); > + return NULL; > + } > + > + /* sock_def_destruct frees this for us */ > + sk->sk_protinfo = rs; > + rs->rs_sk = sk; > + > + return sk; > +} > + > +#undef sk_alloc > +#define sk_alloc sk_alloc_compat > +#endif /* KERNEL_HAS_PROTO_REGISTER */ > + > +static int __rds_create(struct socket *sock, struct sock *sk, int protocol) > +{ > + unsigned long flags; > + struct rds_sock *rs; > + > + sock_init_data(sock, sk); > +#ifndef KERNEL_HAS_PROTO_REGISTER > + /* Can this be moved to sk_alloc_compat? */ > + sk_set_owner(sk, THIS_MODULE); > +#endif /* KERNEL_HAS_PROTO_REGISTER */ > + sock->ops = &rds_proto_ops; > + sk->sk_protocol = protocol; > + > + rs = rds_sk_to_rs(sk); > + spin_lock_init(&rs->rs_lock); > + rwlock_init(&rs->rs_recv_lock); > + INIT_LIST_HEAD(&rs->rs_send_queue); > + INIT_LIST_HEAD(&rs->rs_recv_queue); > + INIT_LIST_HEAD(&rs->rs_notify_queue); > + INIT_LIST_HEAD(&rs->rs_cong_list); > + spin_lock_init(&rs->rs_rdma_lock); > + rs->rs_rdma_keys = RB_ROOT; > + > + spin_lock_irqsave(&rds_sock_lock, flags); > + list_add_tail(&rs->rs_item, &rds_sock_list); > + rds_sock_count++; > + spin_unlock_irqrestore(&rds_sock_lock, flags); > + > + return 0; > +} > + > +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24) > +static int rds_create(struct socket *sock, int protocol) > +{ > + struct sock *sk; > + > + if (sock->type != SOCK_SEQPACKET || protocol) > + return -ESOCKTNOSUPPORT; > + > + sk = sk_alloc(AF_RDS, GFP_ATOMIC, &rds_proto, 1); > + if (sk == NULL) > + return -ENOMEM; > + > + return __rds_create(sock, sk, protocol); > +} > +#else > +static int rds_create(struct net *net, struct socket *sock, int protocol) > +{ > + struct sock *sk; > + > + if (sock->type != SOCK_SEQPACKET || protocol) > + return -ESOCKTNOSUPPORT; > + > + sk = sk_alloc(net, AF_RDS, GFP_ATOMIC, &rds_proto); > + if (!sk) > + return -ENOMEM; > + > + return __rds_create(sock, sk, protocol); > +} > +#endif > + > +void rds_sock_addref(struct rds_sock *rs) > +{ > + sock_hold(rds_rs_to_sk(rs)); > +} > + > +void rds_sock_put(struct rds_sock *rs) > +{ > + sock_put(rds_rs_to_sk(rs)); > +} > + > +static struct net_proto_family rds_family_ops = { > + .family = AF_RDS, > + .create = rds_create, > + .owner = THIS_MODULE, > +}; > + > +static void rds_sock_inc_info(struct socket *sock, unsigned int len, > + struct rds_info_iterator *iter, > + struct rds_info_lengths *lens) > +{ > + struct rds_sock *rs; > + struct sock *sk; > + struct rds_incoming *inc; > + unsigned long flags; > + unsigned int total = 0; > + > + len /= sizeof(struct rds_info_message); > + > + spin_lock_irqsave(&rds_sock_lock, flags); > + > + list_for_each_entry(rs, &rds_sock_list, rs_item) { > + sk = rds_rs_to_sk(rs); > + read_lock(&rs->rs_recv_lock); > + > + /* XXX too lazy to maintain counts.. */ > + list_for_each_entry(inc, &rs->rs_recv_queue, i_item) { > + total++; > + if (total <= len) > + rds_inc_info_copy(inc, iter, inc->i_saddr, > + rs->rs_bound_addr, 1); > + } > + > + read_unlock(&rs->rs_recv_lock); > + } > + > + spin_unlock_irqrestore(&rds_sock_lock, flags); > + > + lens->nr = total; > + lens->each = sizeof(struct rds_info_message); > +} > + > +static void rds_sock_info(struct socket *sock, unsigned int len, > + struct rds_info_iterator *iter, > + struct rds_info_lengths *lens) > +{ > + struct rds_info_socket sinfo; > + struct rds_sock *rs; > + unsigned long flags; > + > + len /= sizeof(struct rds_info_socket); > + > + spin_lock_irqsave(&rds_sock_lock, flags); > + > + if (len < rds_sock_count) > + goto out; > + > + list_for_each_entry(rs, &rds_sock_list, rs_item) { > + sinfo.sndbuf = rds_sk_sndbuf(rs); > + sinfo.rcvbuf = rds_sk_rcvbuf(rs); > + sinfo.bound_addr = rs->rs_bound_addr; > + sinfo.connected_addr = rs->rs_conn_addr; > + sinfo.bound_port = rs->rs_bound_port; > + sinfo.connected_port = rs->rs_conn_port; > + sinfo.inum = sock_i_ino(rds_rs_to_sk(rs)); > + > + rds_info_copy(iter, &sinfo, sizeof(sinfo)); > + } > + > +out: > + lens->nr = rds_sock_count; > + lens->each = sizeof(struct rds_info_socket); > + > + spin_unlock_irqrestore(&rds_sock_lock, flags); > +} > + > +/* > + * The order is important here. > + * > + * rds_trans_stop_listening() is called before conn_exit so new connections > + * don't hit while existing ones are being torn down. > + * > + * rds_conn_exit() is before rds_trans_exit() as rds_conn_exit() calls into the > + * transports to free connections and incoming fragments as they're torn down. > + */ > +static void __exit rds_exit(void) > +{ > + rds_ib_exit(); > + sock_unregister(rds_family_ops.family); > +#ifdef KERNEL_HAS_PROTO_REGISTER > + proto_unregister(&rds_proto); > +#endif /* KERNEL_HAS_PROTO_REGISTER */ > + rds_trans_stop_listening(); > + rds_conn_exit(); > + rds_cong_exit(); > + rds_sysctl_exit(); > + rds_threads_exit(); > + rds_stats_exit(); > + rds_page_exit(); > + rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info); > + rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); > +} > +module_exit(rds_exit); > + > +static int __init rds_init(void) > +{ > + int ret; > + > +#ifndef KERNEL_HAS_NOT_DEFINED > + /* the strange ifdef above has scripts/makepatch.sh strip this out */ > +#if PF_RDS == 21 > + printk(KERN_ERR "!!! This build of RDS is using PF 21 which is not " > + "reserved\n"); > + printk(KERN_ERR "!!! This is only suitable for testing, DO NOT " > + "RELEASE THIS.\n"); > + printk(KERN_ERR "!!! No, seriously.\n"); > +#endif > +#endif /* KERNEL_HAS_NOT_DEFINED */ You don't want this ifdef crap in upstream > + > + ret = rds_conn_init(); > + if (ret) > + goto out; > + ret = rds_threads_init(); > + if (ret) > + goto out_conn; > + ret = rds_sysctl_init(); > + if (ret) > + goto out_threads; > + ret = rds_stats_init(); > + if (ret) > + goto out_sysctl; > +#ifdef KERNEL_HAS_PROTO_REGISTER > + ret = proto_register(&rds_proto, 1); > + if (ret) > + goto out_stats; > +#endif /* KERNEL_HAS_PROTO_REGISTER */ > + ret = sock_register(&rds_family_ops); > + if (ret) > + goto out_proto; > + > + rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info); > + rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info); > + > + ret = rds_ib_init(); > + if (ret) > + goto out_sock; > + goto out; > + > +out_sock: > + sock_unregister(rds_family_ops.family); > +out_proto: > +#ifdef KERNEL_HAS_PROTO_REGISTER > + proto_unregister(&rds_proto); > +out_stats: > +#endif /* KERNEL_HAS_PROTO_REGISTER */ > + rds_stats_exit(); > +out_sysctl: > + rds_sysctl_exit(); > +out_threads: > + rds_threads_exit(); > +out_conn: > + rds_conn_exit(); > + rds_cong_exit(); > + rds_page_exit(); > +out: > + return ret; > +} > +module_init(rds_init); > + > +#define DRV_VERSION "4.0" > +#define DRV_RELDATE "July 28, 2008" > + > +MODULE_AUTHOR("Zach Brown"); > +MODULE_AUTHOR("Olaf Kirch"); > +MODULE_DESCRIPTION("RDS: Reliable Datagram Sockets" > + " v" DRV_VERSION " (" DRV_RELDATE ")"); > +MODULE_VERSION(DRV_VERSION); > +MODULE_LICENSE("Dual BSD/GPL"); > +MODULE_ALIAS_NETPROTO(PF_RDS); From shemminger at vyatta.com Mon Jan 26 19:48:20 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Mon, 26 Jan 2009 19:48:20 -0800 Subject: [ofa-general] Re: [PATCH 03/21] RDS: Congestion-handling code In-Reply-To: <1233022678-9259-4-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-4-git-send-email-andy.grover@oracle.com> Message-ID: <20090126194820.41cdb7f5@extreme> On Mon, 26 Jan 2009 18:17:40 -0800 Andy Grover wrote: > RDS handles per-socket congestion by updating peers with a complete > congestion map (8KB). This code keeps track of these maps for itself > and ones received from peers. > > Signed-off-by: Andy Grover > --- > drivers/infiniband/ulp/rds/cong.c | 424 +++++++++++++++++++++++++++++++++++++ > 1 files changed, 424 insertions(+), 0 deletions(-) > create mode 100644 drivers/infiniband/ulp/rds/cong.c > > diff --git a/drivers/infiniband/ulp/rds/cong.c b/drivers/infiniband/ulp/rds/cong.c > new file mode 100644 > index 0000000..b7c49d2 > --- /dev/null > +++ b/drivers/infiniband/ulp/rds/cong.c > @@ -0,0 +1,424 @@ > +/* > + * Copyright (c) 2007 Oracle. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > +#include > +#include > + > +#include "rds.h" > + > +/* > + * This file implements the receive side of the unconventional congestion > + * management in RDS. > + * > + * Messages waiting in the receive queue on the receiving socket are accounted > + * against the sockets SO_RCVBUF option value. Only the payload bytes in the > + * message are accounted for. If the number of bytes queued equals or exceeds > + * rcvbuf then the socket is congested. All sends attempted to this socket's > + * address should return block or return -EWOULDBLOCK. > + * > + * Applications are expected to be reasonably tuned such that this situation > + * very rarely occurs. An application encountering this "back-pressure" is > + * considered a bug. > + * > + * This is implemented by having each node maintain bitmaps which indicate > + * which ports on bound addresses are congested. As the bitmap changes it is > + * sent through all the connections which terminate in the local address of the > + * bitmap which changed. > + * > + * The bitmaps are allocated as connections are brought up. This avoids > + * allocation in the interrupt handling path which queues messages on sockets. > + * The dense bitmaps let transports send the entire bitmap on any bitmap change > + * reasonably efficiently. This is much easier to implement than some > + * finer-grained communication of per-port congestion. The sender does a very > + * inexpensive bit test to test if the port it's about to send to is congested > + * or not. > + */ > + > +/* > + * Interaction with poll is a tad tricky. We want all processes stuck in > + * poll to wake up and check whether a congested destination became uncongested. > + * The really sad thing is we have no idea which destinations the application > + * wants to send to - we don't even know which rds_connections are involved. > + * So until we implement a more flexible rds poll interface, we have to make > + * do with this: > + * We maintain a global counter that is incremented each time a congestion map > + * update is received. Each rds socket tracks this value, and if rds_poll > + * finds that the saved generation number is smaller than the global generation > + * number, it wakes up the process. > + */ > +static atomic_t rds_cong_generation = ATOMIC_INIT(0); > + > +/* > + * Congestion monitoring > + */ > +static LIST_HEAD(rds_cong_monitor); > +static DEFINE_RWLOCK(rds_cong_monitor_lock); > + > +/* > + * Yes, a global lock. It's used so infrequently that it's worth keeping it > + * global to simplify the locking. It's only used in the following > + * circumstances: > + * > + * - on connection buildup to associate a conn with its maps > + * - on map changes to inform conns of a new map to send > + * > + * It's sadly ordered under the socket callback lock and the connection lock. > + * Receive paths can mark ports congested from interrupt context so the > + * lock masks interrupts. > + */ So this is starting to look like another "Oracle special" like AIO and HugeTLB. That has lots of caveat restrictions on the application. From davem at davemloft.net Mon Jan 26 20:11:04 2009 From: davem at davemloft.net (David Miller) Date: Mon, 26 Jan 2009 20:11:04 -0800 (PST) Subject: [ofa-general] Re: [PATCH 01/21] RDS: Socket interface In-Reply-To: <1233022678-9259-2-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-2-git-send-email-andy.grover@oracle.com> Message-ID: <20090126.201104.33429150.davem@davemloft.net> Socket family implementations do not belong under the infiniband subdirectory. Put it under net/ instead. I don't care what the interdependencies happen to be. From jackm at dev.mellanox.co.il Tue Jan 27 01:07:02 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 27 Jan 2009 11:07:02 +0200 Subject: [ofa-general] Re: IPoIB kernel Oops -- possible race condition identified. In-Reply-To: <497DEC1A.2030104@Voltaire.COM> References: <200901261741.08824.jackm@dev.mellanox.co.il> <497DEC1A.2030104@Voltaire.COM> Message-ID: <200901271107.03074.jackm@dev.mellanox.co.il> On Monday 26 January 2009 19:00, Yossi Etigin wrote: >  There's a patch of mine in OFED that's probably exposing a bug in ipoib. > The bug is that priv->broadcast can be NULL-ified and join_task does not > protect the check with the spinlock. > The patch may expose the bug because it uses rtnl_lock(). >  However, in 2.6.28 kernel there's another version of this patch which does not > take rtnl_lock, so the problem still exists but is probably much harder to reproduce. > I'm using code IDENTICAL to the 2.6.28 code, except for ipoib_warn and ipoib_dbg_mcast message formatting. The 2.6.28 code is below. I've indicated the hole where priv->broadcast may be set to NULL by another kernel thread. Below, at the end of this post, I have included a suggested patch which fixes the problem. - Jack ==================== 2.6.28 snippet of ipoib_mcast_join_task: while (1) {identical struct ipoib_mcast *mcast = NULL; spin_lock_irq(&priv->lock); list_for_each_entry(mcast, &priv->multicast_list, list) { if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) { /* Found the next unjoined group */ break; } } spin_unlock_irq(&priv->lock); if (&mcast->list == &priv->multicast_list) { /* All done */ break; } ipoib_mcast_join(dev, mcast, 1); return; } ==>***** priv->broadcast MAY BE NULLED OUT HERE! **************<====== priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); if (!ipoib_cm_admin_enabled(dev)) { rtnl_lock(); dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu)); rtnl_unlock(); } ================================= The following patch can fix the problem: ipoib: fix unprotected use of priv->broadcast in ipoib_mcast_join_task. There is a race whereby the ipoib broadcast pointer may be set to NULL by flush while the join task is being started. This protects the broadcast pointer access via a spinlock. If the pointer is indeed NULL, we set the mcast_mtu value to the current admin_mtu value -- since it does not matter anyway, the I/F is going down. Signed-off-by: Jack Morgenstein --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2009-01-27 10:48:07.399491000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2009-01-26 18:17:07.000000000 +0200 @@ -581,7 +593,12 @@ void ipoib_mcast_join_task(struct work_s return; } - priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); + spin_lock_irq(&priv->lock); + if (priv->broadcast) + priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); + else + priv->mcast_mtu = priv->admin_mtu; + spin_unlock_irq(&priv->lock); if (!ipoib_cm_admin_enabled(dev)) { rtnl_lock(); From dotanba at gmail.com Tue Jan 27 03:14:36 2009 From: dotanba at gmail.com (Dotan Barak) Date: Tue, 27 Jan 2009 13:14:36 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Byte_Cnt field in the MTHCA_CQE structure In-Reply-To: References: Message-ID: <2f3bf9a60901270314j9d79f9cr2380dbf14ba065bf@mail.gmail.com> On Mon, Jan 26, 2009 at 11:08 PM, Adit Ranadive wrote: > Hello, > I have been looking at doing some low level work with the OFED library 1.1 > in terms of figuring out how many bytes have been sent by an IB > application. If you have the source code, you can add a counter when posting send request (according to the s/g entries in every posted send request). Dotan From vlad at lists.openfabrics.org Tue Jan 27 03:27:24 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 27 Jan 2009 03:27:24 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090127-0200 daily build status Message-ID: <20090127112725.278CAE61072@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From aluno3 at poczta.onet.pl Tue Jan 27 04:10:20 2009 From: aluno3 at poczta.onet.pl (aluno3 at poczta.onet.pl) Date: Tue, 27 Jan 2009 13:10:20 +0100 Subject: [ofa-general] NetEffect, iw_nes and kernel warning Message-ID: <497EF9AC.70104@poczta.onet.pl> Hello, I'm using "iw_nes" driver for NetEffect 10Gbit card under linux 2.6.27.10. I run it on PC with 8 processors. >From time to time kernel shows following warning: WARNING: at kernel/softirq.c:136 local_bh_enable+0x9b/0xa0() Modules linked in: iscsi_trgt st sg scst_vdisk scst drbd twofish twofish_common serpent blowfish sha256_generic crypto_null iscsi_tcp libiscsi scsi_transport_iscsi bonding e1000e thermal button iw_nes processor inet_lro ftdi_sio usbserial nls_iso8859_1 nls_cp437 megaraid_sas vfat fat aufs [last unloaded: iscsi_trgt] Pid: 0, comm: swapper Not tainted 2.6.27.10 #57 [] warn_on_slowpath+0x3e/0x60 [] prepare_signal+0xa0/0x130 [] read_tsc+0x35/0x40 [] read_tsc+0x35/0x40 [] getnstimeofday+0x38/0x160 [] read_tsc+0x35/0x40 [] getnstimeofday+0x38/0x160 [] clockevents_program_event+0xbc/0x140 [] tick_dev_program_event+0x56/0x90 [] local_bh_enable+0x9b/0xa0 [] skb_copy_bits+0x155/0x290 [] __pskb_pull_tail+0x5c/0x2e0 [] nes_netdev_start_xmit+0x815/0x8a0 [iw_nes] [] nes_netdev_start_xmit+0x10d/0x8a0 [iw_nes] [] getnstimeofday+0x38/0x160 [] clockevents_program_event+0xbc/0x140 [] tick_dev_program_event+0x56/0x90 [] tick_program_event+0x1f/0x30 [] hrtimer_interrupt+0x146/0x1b0 [] nes_nic_send+0x3c7/0x400 [iw_nes] [] dev_hard_start_xmit+0x51/0xd0 [] __qdisc_run+0x19d/0x220 [] dev_queue_xmit+0x284/0x2a0 [] ip_finish_output+0x108/0x290 [] ip_output+0x8a/0x90 [] ip_local_out+0x15/0x20 [] ip_queue_xmit+0x324/0x390 [] ip_local_out+0x15/0x20 [] ip_queue_xmit+0x324/0x390 [] lock_timer_base+0x19/0x40 [] tcp_write_timer+0x0/0xd0 [] tcp_select_window+0x2e/0xd0 [] tcp_transmit_skb+0x27e/0x3c0 [] tcp_write_xmit+0x16d/0x240 [] __tcp_push_pending_frames+0x14/0x70 [] tcp_rcv_established+0x3a4/0x740 [] tcp_v4_do_rcv+0xc1/0xd0 [] tcp_v4_rcv+0x4c0/0x580 [] ip_local_deliver_finish+0x4a/0x160 [] ip_local_deliver+0x7d/0x90 [] ip_rcv_finish+0xd6/0x2d0 [] ip_rcv+0x18c/0x290 [] netif_receive_skb+0x228/0x270 [] nes_nic_ce_handler+0x53d/0x680 [iw_nes] [] nes_netdev_poll+0x39/0xc0 [iw_nes] [] net_rx_action+0x7a/0x150 [] __do_softirq+0x78/0xf0 [] do_softirq+0x38/0x40 [] irq_exit+0x75/0x90 [] do_IRQ+0x3d/0x70 [] common_interrupt+0x23/0x30 [] mwait_idle+0x2a/0x30 [] cpu_idle+0x50/0x90 ======================= I haven't noticed any serious problems so far, but I'm not sure if this warning isn't caused by some bug,which will cause some serious problem in the future.Is the reason for this warning appearance known? Is there any fix for it? Thanks From sashak at voltaire.com Tue Jan 27 04:19:11 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 27 Jan 2009 14:19:11 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsysstat: use RMPP for client/server communication In-Reply-To: References: <20090126212621.GL5814@sashak.voltaire.com> Message-ID: <20090127121911.GA8534@sashak.voltaire.com> Hi Hal, On 17:14 Mon 26 Jan , Hal Rosenstock wrote: > On Mon, Jan 26, 2009 at 4:26 PM, Sasha Khapyorsky wrote: > > > > This patch adds support for bigger than (256 - vendor2 data offset) data > > sending by ibsysstat server using RMPP. It fixes bug#1237 - where server > > output was truncated due to MAD size limitation. > > Seems like the class version should be bumped for this change. Should it? Class vendor2 permits RMPP and it is defined in the spec as version 1. I think it was ibsysstat bug/decision to not use/handle it. > What's the behavior of old client with new server and new client with > old server ? Basically it works fine together. Old server responds short MAD with RMPP flags inactive. The only interesting case is when old client gets long RMPP reply from new server - it will grab 256 bytes (so as it was before the data will be truncated) and server will get timeout from RMPP layer - and yes, seems we need to handle(drop) this MAD (as well as possible RMPP unrelated timeout/error on client side). Something like this: diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index c20a6f0..a145daf 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -169,6 +174,11 @@ static char *ibsystat_serv(void) DEBUG("starting to serve..."); while ((umad = mad_receive(buf, -1))) { + if (umad_status(buf)) { + DEBUG("drop mad with status %x: %s", umad_status(buf), + strerror(umad_status(buf))); + continue; + } mad = umad_get_mad(umad); @@ -235,6 +245,9 @@ static char *ibsystat(ib_portid_t *portid, int attr) if (umad_recv(fd, buf, &len, timeout) < 0) IBPANIC("umad_recv failed."); + if (umad_status(buf)) + return strerror(umad_status(buf)); + DEBUG("Got sysstat pong.."); if (attr != IB_PING_ATTR) puts(data); Sasha From sashak at voltaire.com Tue Jan 27 04:46:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 27 Jan 2009 14:46:56 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsysstat: use RMPP for client/server communication In-Reply-To: <20090127121911.GA8534@sashak.voltaire.com> References: <20090126212621.GL5814@sashak.voltaire.com> <20090127121911.GA8534@sashak.voltaire.com> Message-ID: <20090127124648.GC8534@sashak.voltaire.com> On 14:19 Tue 27 Jan , Sasha Khapyorsky wrote: > > diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c > index c20a6f0..a145daf 100644 > --- a/infiniband-diags/src/ibsysstat.c > +++ b/infiniband-diags/src/ibsysstat.c > @@ -169,6 +174,11 @@ static char *ibsystat_serv(void) > DEBUG("starting to serve..."); > > while ((umad = mad_receive(buf, -1))) { > + if (umad_status(buf)) { > + DEBUG("drop mad with status %x: %s", umad_status(buf), > + strerror(umad_status(buf))); > + continue; > + } And also to prevent timeouts when we are not really using large packets: diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c index c20a6f0..2d44ef6 100644 --- a/infiniband-diags/src/ibsysstat.c +++ b/infiniband-diags/src/ibsysstat.c @@ -89,7 +89,8 @@ static int server_respond(void *umad, int size) rpc.oui = mad_get_field(mad, 0, IB_VEND2_OUI_F); rpc.trid = mad_get_field64(mad, 0, IB_MAD_TRID_F); - rmpp.flags = IB_RMPP_FLAG_ACTIVE; + if (size > IB_MAD_SIZE) + rmpp.flags = IB_RMPP_FLAG_ACTIVE; DEBUG("responding %d bytes to %s, attr 0x%x mod 0x%x qkey %x", size, portid2str(&rport), rpc.attr.id, rpc.attr.mod, rport.qkey); Sasha From vlad at dev.mellanox.co.il Tue Jan 27 06:33:12 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 27 Jan 2009 16:33:12 +0200 Subject: [ofa-general] [ANNOUNCE] RHEL5.3 support added to OFED-1.4 (latest daily build) Message-ID: <497F1B28.4010607@dev.mellanox.co.il> Hi, RHEL5.3 support was added to OFED-1.4 daily builds, starting from OFED-1.4-20090127-0600. OFED-1.4 daily builds are available under: http://www.openfabrics.org/downloads/OFED/ofed-1.4-daily/ Regards, Vladimir From swise at opengridcomputing.com Tue Jan 27 07:34:57 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 Jan 2009 09:34:57 -0600 Subject: [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <497F29A1.60003@opengridcomputing.com> Hey Andy, Why didn't you include the iWARP transport as well? Andy Grover wrote: > Hi Roland, > > This patchset adds support for RDS as an Infiniband ULP. RDS is an > Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably, > and is used currently in Oracle RAC and Exadata products. It's lived > in OFED for 2+ years and I think it's time to get it upstream -- most > likely into your -next tree for .30, but if it snuck into .29 via the > "new code merge-window exception" then even better. > > I've run checkpatch & sparse to clean up as many issues as possible so > what remains are really the design peculiarities (aka warts) that arise > from being a protocol designed by one company for a single critical > application. I think upstreaming this code is the first step towards > working out those issues, and making the end result available to a wider > audience. > > Also available for review at: > git://git.openfabrics.org/~agrover/ofed_1_4/linux-2.6 for-roland > > Thoughts? shortlog follows. > > Thanks -- Regards -- Andy > > Andy Grover (21): > RDS: Socket interface > RDS: Main header file > RDS: Congestion-handling code > RDS: Transport code > RDS: Info and stats > RDS: Connection handling > RDS: loopback > RDS: sysctls > RDS: Message parsing > RDS: send.c > RDS: recv.c > RDS: RDMA support > RDS/IB: Infiniband transport > RDS/IB: Ring-handling code. > RDS/IB: Implement RDMA ops using FMRs > RDS/IB: Implement IB-specific datagram send. > RDS/IB: Receive datagrams via IB > RDS/IB: Stats and sysctls > RDS: Documentation > RDS: Kconfig and Makefile > RDS: Add AF and PF #defines for RDS sockets > > Documentation/networking/rds.txt | 356 +++++++++++ > drivers/infiniband/Kconfig | 2 + > drivers/infiniband/Makefile | 1 + > drivers/infiniband/ulp/rds/Kconfig | 13 + > drivers/infiniband/ulp/rds/Makefile | 13 + > drivers/infiniband/ulp/rds/af_rds.c | 677 +++++++++++++++++++++ > drivers/infiniband/ulp/rds/bind.c | 202 +++++++ > drivers/infiniband/ulp/rds/cong.c | 424 +++++++++++++ > drivers/infiniband/ulp/rds/connection.c | 501 +++++++++++++++ > drivers/infiniband/ulp/rds/ib.c | 312 ++++++++++ > drivers/infiniband/ulp/rds/ib.h | 358 +++++++++++ > drivers/infiniband/ulp/rds/ib_cm.c | 882 > +++++++++++++++++++++++++++ > drivers/infiniband/ulp/rds/ib_rdma.c | 641 ++++++++++++++++++++ > drivers/infiniband/ulp/rds/ib_recv.c | 894 > +++++++++++++++++++++++++++ > drivers/infiniband/ulp/rds/ib_ring.c | 168 +++++ > drivers/infiniband/ulp/rds/ib_send.c | 852 > ++++++++++++++++++++++++++ > drivers/infiniband/ulp/rds/ib_stats.c | 95 +++ > drivers/infiniband/ulp/rds/ib_sysctl.c | 137 +++++ > drivers/infiniband/ulp/rds/info.c | 243 ++++++++ > drivers/infiniband/ulp/rds/info.h | 43 ++ > drivers/infiniband/ulp/rds/loop.c | 189 ++++++ > drivers/infiniband/ulp/rds/loop.h | 9 + > drivers/infiniband/ulp/rds/message.c | 414 +++++++++++++ > drivers/infiniband/ulp/rds/page.c | 222 +++++++ > drivers/infiniband/ulp/rds/rdma.c | 682 +++++++++++++++++++++ > drivers/infiniband/ulp/rds/rdma.h | 84 +++ > drivers/infiniband/ulp/rds/rds.h | 763 +++++++++++++++++++++++ > drivers/infiniband/ulp/rds/rds_rdma.h | 245 ++++++++ > drivers/infiniband/ulp/rds/recv.c | 550 +++++++++++++++++ > drivers/infiniband/ulp/rds/send.c | 1006 > +++++++++++++++++++++++++++++++ > drivers/infiniband/ulp/rds/stats.c | 150 +++++ > drivers/infiniband/ulp/rds/sysctl.c | 164 +++++ > drivers/infiniband/ulp/rds/threads.c | 273 +++++++++ > drivers/infiniband/ulp/rds/transport.c | 134 ++++ > include/linux/socket.h | 4 +- > 35 files changed, 11702 insertions(+), 1 deletions(-) > create mode 100644 Documentation/networking/rds.txt > create mode 100644 drivers/infiniband/ulp/rds/Kconfig > create mode 100644 drivers/infiniband/ulp/rds/Makefile > create mode 100644 drivers/infiniband/ulp/rds/af_rds.c > create mode 100644 drivers/infiniband/ulp/rds/bind.c > create mode 100644 drivers/infiniband/ulp/rds/cong.c > create mode 100644 drivers/infiniband/ulp/rds/connection.c > create mode 100644 drivers/infiniband/ulp/rds/ib.c > create mode 100644 drivers/infiniband/ulp/rds/ib.h > create mode 100644 drivers/infiniband/ulp/rds/ib_cm.c > create mode 100644 drivers/infiniband/ulp/rds/ib_rdma.c > create mode 100644 drivers/infiniband/ulp/rds/ib_recv.c > create mode 100644 drivers/infiniband/ulp/rds/ib_ring.c > create mode 100644 drivers/infiniband/ulp/rds/ib_send.c > create mode 100644 drivers/infiniband/ulp/rds/ib_stats.c > create mode 100644 drivers/infiniband/ulp/rds/ib_sysctl.c > create mode 100644 drivers/infiniband/ulp/rds/info.c > create mode 100644 drivers/infiniband/ulp/rds/info.h > create mode 100644 drivers/infiniband/ulp/rds/loop.c > create mode 100644 drivers/infiniband/ulp/rds/loop.h > create mode 100644 drivers/infiniband/ulp/rds/message.c > create mode 100644 drivers/infiniband/ulp/rds/page.c > create mode 100644 drivers/infiniband/ulp/rds/rdma.c > create mode 100644 drivers/infiniband/ulp/rds/rdma.h > create mode 100644 drivers/infiniband/ulp/rds/rds.h > create mode 100644 drivers/infiniband/ulp/rds/rds_rdma.h > create mode 100644 drivers/infiniband/ulp/rds/recv.c > create mode 100644 drivers/infiniband/ulp/rds/send.c > create mode 100644 drivers/infiniband/ulp/rds/stats.c > create mode 100644 drivers/infiniband/ulp/rds/sysctl.c > create mode 100644 drivers/infiniband/ulp/rds/threads.c > create mode 100644 drivers/infiniband/ulp/rds/transport.c > > end > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From remi.denis-courmont at nokia.com Mon Jan 26 23:27:51 2009 From: remi.denis-courmont at nokia.com (=?iso-8859-15?q?R=E9mi?= Denis-Courmont) Date: Tue, 27 Jan 2009 09:27:51 +0200 Subject: [ofa-general] Re: [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets In-Reply-To: <1233022678-9259-22-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-22-git-send-email-andy.grover@oracle.com> Message-ID: <200901270927.51855.remi.denis-courmont@nokia.com> On Tuesday 27 January 2009 04:17:58 ext Andy Grover, you wrote: > RDS is a reliable datagram protocol used for IPC on Oracle > database clusters. This adds address and protocol family numbers > for it. > > Signed-off-by: Andy Grover > --- > include/linux/socket.h | 4 +++- > 1 files changed, 3 insertions(+), 1 deletions(-) > > diff --git a/include/linux/socket.h b/include/linux/socket.h > index 20fc4bb..fda91af 100644 > --- a/include/linux/socket.h > +++ b/include/linux/socket.h > @@ -191,7 +191,8 @@ struct ucred { > #define AF_RXRPC 33 /* RxRPC sockets */ > #define AF_ISDN 34 /* mISDN sockets */ > #define AF_PHONET 35 /* Phonet sockets */ > -#define AF_MAX 36 /* For now.. */ > +#define AF_RDS 36 /* RDS sockets */ > +#define AF_MAX 37 /* For now.. */ > > /* Protocol families, same as address families. */ > #define PF_UNSPEC AF_UNSPEC > @@ -229,6 +230,7 @@ struct ucred { > #define PF_RXRPC AF_RXRPC > #define PF_ISDN AF_ISDN > #define PF_PHONET AF_PHONET > +#define PF_RDS AF_RDS > #define PF_MAX AF_MAX > > /* Maximum queue length specifiable by listen. */ You also need to add lock class declaration to net/core/sock.c, I believe. -- Rémi Denis-Courmont Maemo Software, Nokia Devices R&D From remi.denis-courmont at nokia.com Mon Jan 26 23:34:16 2009 From: remi.denis-courmont at nokia.com (=?iso-8859-15?q?R=E9mi?= Denis-Courmont) Date: Tue, 27 Jan 2009 09:34:16 +0200 Subject: [ofa-general] Re: [PATCH 02/21] RDS: Main header file In-Reply-To: <1233022678-9259-3-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-3-git-send-email-andy.grover@oracle.com> Message-ID: <200901270934.16669.remi.denis-courmont@nokia.com> On Tuesday 27 January 2009 04:17:39 ext Andy Grover, you wrote: > +/* > + * XXX randomly chosen, but at least seems to be unused: > + * # 18464-18768 Unassigned > + * We should do better. We want a reserved port to discourage unpriv'ed > + * userspace from listening. > + */ > +#define RDS_PORT 18634 Internet transport protocol port number? IANA has a process for assigning port numbers to proprietary protocols. Not that I'd blame you, as I inherited VLC media player's wide abuse of port 1234 as its current network core maintainer :( > +#ifndef AF_RDS > +#define AF_RDS 28 /* Reliable Datagram Socket */ > +#endif > + > +#ifndef PF_RDS > +#define PF_RDS AF_RDS > +#endif You should probably remove that and put the last patch of your series ahead of this one. > +#ifndef SOL_RDS > +#define SOL_RDS 272 > +#endif This is used by RXRPC nowadays, although I myself don't really understand why socket option levels need to be unique across all families. -- Rémi Denis-Courmont Maemo Software, Nokia Devices R&D From aluno3 at poczta.onet.pl Tue Jan 27 03:53:26 2009 From: aluno3 at poczta.onet.pl (aluno3 at poczta.onet.pl) Date: Tue, 27 Jan 2009 12:53:26 +0100 Subject: [ofa-general] NetEffect, iw_nes and kernel warning Message-ID: <497EF5B6.80607@poczta.onet.pl> Hello, I'm using "iw_nes" driver for NetEffect 10Gbit card under linux 2.6.27.10. I run it on PC with 8 processors. >From time to time kernel shows following warning: WARNING: at kernel/softirq.c:136 local_bh_enable+0x9b/0xa0() Modules linked in: iscsi_trgt st sg scst_vdisk scst drbd twofish twofish_common serpent blowfish sha256_generic crypto_null iscsi_tcp libiscsi scsi_transport_iscsi bonding e1000e thermal button iw_nes processor inet_lro ftdi_sio usbserial nls_iso8859_1 nls_cp437 megaraid_sas vfat fat aufs [last unloaded: iscsi_trgt] Pid: 0, comm: swapper Not tainted 2.6.27.10 #57 [] warn_on_slowpath+0x3e/0x60 [] prepare_signal+0xa0/0x130 [] read_tsc+0x35/0x40 [] read_tsc+0x35/0x40 [] getnstimeofday+0x38/0x160 [] read_tsc+0x35/0x40 [] getnstimeofday+0x38/0x160 [] clockevents_program_event+0xbc/0x140 [] tick_dev_program_event+0x56/0x90 [] local_bh_enable+0x9b/0xa0 [] skb_copy_bits+0x155/0x290 [] __pskb_pull_tail+0x5c/0x2e0 [] nes_netdev_start_xmit+0x815/0x8a0 [iw_nes] [] nes_netdev_start_xmit+0x10d/0x8a0 [iw_nes] [] getnstimeofday+0x38/0x160 [] clockevents_program_event+0xbc/0x140 [] tick_dev_program_event+0x56/0x90 [] tick_program_event+0x1f/0x30 [] hrtimer_interrupt+0x146/0x1b0 [] nes_nic_send+0x3c7/0x400 [iw_nes] [] dev_hard_start_xmit+0x51/0xd0 [] __qdisc_run+0x19d/0x220 [] dev_queue_xmit+0x284/0x2a0 [] ip_finish_output+0x108/0x290 [] ip_output+0x8a/0x90 [] ip_local_out+0x15/0x20 [] ip_queue_xmit+0x324/0x390 [] ip_local_out+0x15/0x20 [] ip_queue_xmit+0x324/0x390 [] lock_timer_base+0x19/0x40 [] tcp_write_timer+0x0/0xd0 [] tcp_select_window+0x2e/0xd0 [] tcp_transmit_skb+0x27e/0x3c0 [] tcp_write_xmit+0x16d/0x240 [] __tcp_push_pending_frames+0x14/0x70 [] tcp_rcv_established+0x3a4/0x740 [] tcp_v4_do_rcv+0xc1/0xd0 [] tcp_v4_rcv+0x4c0/0x580 [] ip_local_deliver_finish+0x4a/0x160 [] ip_local_deliver+0x7d/0x90 [] ip_rcv_finish+0xd6/0x2d0 [] ip_rcv+0x18c/0x290 [] netif_receive_skb+0x228/0x270 [] nes_nic_ce_handler+0x53d/0x680 [iw_nes] [] nes_netdev_poll+0x39/0xc0 [iw_nes] [] net_rx_action+0x7a/0x150 [] __do_softirq+0x78/0xf0 [] do_softirq+0x38/0x40 [] irq_exit+0x75/0x90 [] do_IRQ+0x3d/0x70 [] common_interrupt+0x23/0x30 [] mwait_idle+0x2a/0x30 [] cpu_idle+0x50/0x90 ======================= I haven't noticed any serious problems so far, but I'm not sure if this warning isn't caused by some bug,which will cause some serious problem in the future.Is the reason for this warning appearance known? Is there any fix for it? Thanks From arturp1 at open-e.com Tue Jan 27 04:05:15 2009 From: arturp1 at open-e.com (arturp1) Date: Tue, 27 Jan 2009 13:05:15 +0100 Subject: [ofa-general] NetEffect, iw_nes and kernel warning Message-ID: <497EF87B.4090502@open-e.com> Hello, I'm using "iw_nes" driver for NetEffect 10Gbit card under linux 2.6.27.10. I run it on PC with 8 processors. >From time to time kernel shows following warning: WARNING: at kernel/softirq.c:136 local_bh_enable+0x9b/0xa0() Modules linked in: iscsi_trgt st sg scst_vdisk scst drbd twofish twofish_common serpent blowfish sha256_generic crypto_null iscsi_tcp libiscsi scsi_transport_iscsi bonding e1000e thermal button iw_nes processor inet_lro ftdi_sio usbserial nls_iso8859_1 nls_cp437 megaraid_sas vfat fat aufs [last unloaded: iscsi_trgt] Pid: 0, comm: swapper Not tainted 2.6.27.10 #57 [] warn_on_slowpath+0x3e/0x60 [] prepare_signal+0xa0/0x130 [] read_tsc+0x35/0x40 [] read_tsc+0x35/0x40 [] getnstimeofday+0x38/0x160 [] read_tsc+0x35/0x40 [] getnstimeofday+0x38/0x160 [] clockevents_program_event+0xbc/0x140 [] tick_dev_program_event+0x56/0x90 [] local_bh_enable+0x9b/0xa0 [] skb_copy_bits+0x155/0x290 [] __pskb_pull_tail+0x5c/0x2e0 [] nes_netdev_start_xmit+0x815/0x8a0 [iw_nes] [] nes_netdev_start_xmit+0x10d/0x8a0 [iw_nes] [] getnstimeofday+0x38/0x160 [] clockevents_program_event+0xbc/0x140 [] tick_dev_program_event+0x56/0x90 [] tick_program_event+0x1f/0x30 [] hrtimer_interrupt+0x146/0x1b0 [] nes_nic_send+0x3c7/0x400 [iw_nes] [] dev_hard_start_xmit+0x51/0xd0 [] __qdisc_run+0x19d/0x220 [] dev_queue_xmit+0x284/0x2a0 [] ip_finish_output+0x108/0x290 [] ip_output+0x8a/0x90 [] ip_local_out+0x15/0x20 [] ip_queue_xmit+0x324/0x390 [] ip_local_out+0x15/0x20 [] ip_queue_xmit+0x324/0x390 [] lock_timer_base+0x19/0x40 [] tcp_write_timer+0x0/0xd0 [] tcp_select_window+0x2e/0xd0 [] tcp_transmit_skb+0x27e/0x3c0 [] tcp_write_xmit+0x16d/0x240 [] __tcp_push_pending_frames+0x14/0x70 [] tcp_rcv_established+0x3a4/0x740 [] tcp_v4_do_rcv+0xc1/0xd0 [] tcp_v4_rcv+0x4c0/0x580 [] ip_local_deliver_finish+0x4a/0x160 [] ip_local_deliver+0x7d/0x90 [] ip_rcv_finish+0xd6/0x2d0 [] ip_rcv+0x18c/0x290 [] netif_receive_skb+0x228/0x270 [] nes_nic_ce_handler+0x53d/0x680 [iw_nes] [] nes_netdev_poll+0x39/0xc0 [iw_nes] [] net_rx_action+0x7a/0x150 [] __do_softirq+0x78/0xf0 [] do_softirq+0x38/0x40 [] irq_exit+0x75/0x90 [] do_IRQ+0x3d/0x70 [] common_interrupt+0x23/0x30 [] mwait_idle+0x2a/0x30 [] cpu_idle+0x50/0x90 ======================= I haven't noticed any serious problems so far, but I'm not sure if this warning isn't caused by some bug,which will cause some serious problem in the future.Is the reason for this warning appearance known? Is there any fix for it? Thanks From zbr at ioremap.net Tue Jan 27 04:08:40 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Tue, 27 Jan 2009 15:08:40 +0300 Subject: [ofa-general] Re: [PATCH 01/21] RDS: Socket interface In-Reply-To: <1233022678-9259-2-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-2-git-send-email-andy.grover@oracle.com> Message-ID: <20090127120840.GC2646@ioremap.net> Hi Andy. On Mon, Jan 26, 2009 at 06:17:38PM -0800, Andy Grover (andy.grover at oracle.com) wrote: > +/* this is just used for stats gathering :/ */ Shouldn't this be some kind of per-cpu data? > +static DEFINE_SPINLOCK(rds_sock_lock); > +static unsigned long rds_sock_count; > +static LIST_HEAD(rds_sock_list); > +DECLARE_WAIT_QUEUE_HEAD(rds_poll_waitq); Global list of all sockets? This does not scale, maybe it should be groupped into hash table or be per-device? > +static int rds_release(struct socket *sock) > +{ > + struct sock *sk = sock->sk; > + struct rds_sock *rs; > + unsigned long flags; > + > + if (sk == NULL) > + goto out; > + > + rs = rds_sk_to_rs(sk); > + > + sock_orphan(sk); Why is it needed getting socket is about to be freed? > + /* Note - rds_clear_recv_queue grabs rs_recv_lock, so > + * that ensures the recv path has completed messing > + * with the socket. */ > + rds_clear_recv_queue(rs); > + rds_cong_remove_socket(rs); > + rds_remove_bound(rs); > + rds_send_drop_to(rs, NULL); > + rds_rdma_drop_keys(rs); > + rds_notify_queue_get(rs, NULL); > + > + spin_lock_irqsave(&rds_sock_lock, flags); > + list_del_init(&rs->rs_item); > + rds_sock_count--; > + spin_unlock_irqrestore(&rds_sock_lock, flags); Does RDS sockets work with high number of creation/destruction workloads? > +static unsigned int rds_poll(struct file *file, struct socket *sock, > + poll_table *wait) > +{ > + struct sock *sk = sock->sk; > + struct rds_sock *rs = rds_sk_to_rs(sk); > + unsigned int mask = 0; > + unsigned long flags; > + > + poll_wait(file, sk->sk_sleep, wait); > + > + poll_wait(file, &rds_poll_waitq, wait); > + Are you absolutely sure that provided poll_table callback will not do the bad things here? It is quite unusual to add several different queues into the same head in the poll callback. And shouldn't rds_poll_waitq be lock protected here? > + read_lock_irqsave(&rs->rs_recv_lock, flags); > + if (!rs->rs_cong_monitor) { > + /* When a congestion map was updated, we signal POLLIN for > + * "historical" reasons. Applications can also poll for > + * WRBAND instead. */ > + if (rds_cong_updated_since(&rs->rs_cong_track)) > + mask |= (POLLIN | POLLRDNORM | POLLWRBAND); > + } else { > + spin_lock(&rs->rs_lock); Is there a possibility to have lock iteraction problem with above rs_recv_lock read lock? > +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24) This should be dropped in the mainline tree. > +/* > + * XXX this probably still needs more work.. no INADDR_ANY, and rbtrees aren't > + * particularly zippy. > + * > + * This is now called for every incoming frame so we arguably care much more > + * about it than we used to. > + */ > +static DEFINE_SPINLOCK(rds_bind_lock); > +static struct rb_root rds_bind_tree = RB_ROOT; Hash table with the appropriate size will have faster lookup/access times btw. > +static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port, > + struct rds_sock *insert) > +{ > + struct rb_node **p = &rds_bind_tree.rb_node; > + struct rb_node *parent = NULL; > + struct rds_sock *rs; > + u64 cmp; > + u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port); > + > + while (*p) { > + parent = *p; > + rs = rb_entry(parent, struct rds_sock, rs_bound_node); > + > + cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) | > + be16_to_cpu(rs->rs_bound_port); > + > + if (needle < cmp) Should it use wrapping logic if some field overflows? > + rdsdebug("returning rs %p for %u.%u.%u.%u:%u\n", rs, NIPQUAD(addr), > + ntohs(port)); Iirc there is a new %pi4 or similar format id. -- Evgeniy Polyakov From zbr at ioremap.net Tue Jan 27 05:05:43 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Tue, 27 Jan 2009 16:05:43 +0300 Subject: [ofa-general] Re: [PATCH 02/21] RDS: Main header file In-Reply-To: <1233022678-9259-3-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-3-git-send-email-andy.grover@oracle.com> Message-ID: <20090127130543.GD2646@ioremap.net> Hi. On Mon, Jan 26, 2009 at 06:17:39PM -0800, Andy Grover (andy.grover at oracle.com) wrote: > +/* > + * XXX randomly chosen, but at least seems to be unused: > + * # 18464-18768 Unassigned > + * We should do better. We want a reserved port to discourage unpriv'ed > + * userspace from listening. > + */ > +#define RDS_PORT 18634 > + What will happen if some application already uses that port? > +#ifndef AF_RDS > +#define AF_RDS 28 /* Reliable Datagram Socket */ > +#endif From zbr at ioremap.net Tue Jan 27 05:10:49 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Tue, 27 Jan 2009 16:10:49 +0300 Subject: [ofa-general] Re: [PATCH 03/21] RDS: Congestion-handling code In-Reply-To: <1233022678-9259-4-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-4-git-send-email-andy.grover@oracle.com> Message-ID: <20090127131049.GE2646@ioremap.net> On Mon, Jan 26, 2009 at 06:17:40PM -0800, Andy Grover (andy.grover at oracle.com) wrote: > +/* > + * Yes, a global lock. It's used so infrequently that it's worth keeping it > + * global to simplify the locking. It's only used in the following > + * circumstances: > + * > + * - on connection buildup to associate a conn with its maps Is this a rare condition? Is this protocol only intended for the long-living connections and is not suitable for the cases when lots of them are created and teared down quickly? -- Evgeniy Polyakov From zbr at ioremap.net Tue Jan 27 05:18:08 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Tue, 27 Jan 2009 16:18:08 +0300 Subject: [ofa-general] Re: [PATCH 04/21] RDS: Transport code In-Reply-To: <1233022678-9259-5-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-5-git-send-email-andy.grover@oracle.com> Message-ID: <20090127131808.GF2646@ioremap.net> On Mon, Jan 26, 2009 at 06:17:41PM -0800, Andy Grover (andy.grover at oracle.com) wrote: > +static LIST_HEAD(transports); > +static DECLARE_RWSEM(trans_sem); > + RDS_ prefix? > +int rds_trans_register(struct rds_transport *trans) > +{ > + BUG_ON(strlen(trans->t_name) + 1 > > + sizeof(((struct rds_info_connection *)0)->transport)); > + Wow. Why not declare 15 as some constant and put it into rds_transport structure definition? > +struct rds_transport *rds_trans_get_preferred(__be32 addr) > +{ > + struct rds_transport *trans; > + struct rds_transport *ret = NULL; > + > + if (IN_LOOPBACK(ntohl(addr))) > + return &rds_loop_transport; > + Tabs have run away. -- Evgeniy Polyakov From zbr at ioremap.net Tue Jan 27 05:28:00 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Tue, 27 Jan 2009 16:28:00 +0300 Subject: [ofa-general] Re: [PATCH 05/21] RDS: Info and stats In-Reply-To: <1233022678-9259-6-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-6-git-send-email-andy.grover@oracle.com> Message-ID: <20090127132759.GG2646@ioremap.net> On Mon, Jan 26, 2009 at 06:17:42PM -0800, Andy Grover (andy.grover at oracle.com) wrote: > +void rds_info_register_func(int optname, rds_info_func func) > +{ > + int offset = optname - RDS_INFO_FIRST; > + > + BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST); > + > + spin_lock(&rds_info_lock); > + BUG_ON(rds_info_funcs[offset] != NULL); > + rds_info_funcs[offset] = func; > + spin_unlock(&rds_info_lock); > +} > +EXPORT_SYMBOL_GPL(rds_info_register_func); > + > +void rds_info_deregister_func(int optname, rds_info_func func) > +{ > + int offset = optname - RDS_INFO_FIRST; > + > + BUG_ON(optname < RDS_INFO_FIRST || optname > RDS_INFO_LAST); > + Those bug_ons look quite scary, is there a way to actually have a wrong optname? Plus, those _INFO definitions are declared twice in the code, which makes it harder to update. > +/* > + * Typically we hold an atomic kmap across multiple rds_info_copy() calls > + * because the kmap is so expensive. This must be called before using blocking > + * operations while holding the mapping and as the iterator is torn down. > + */ > +void rds_info_iter_unmap(struct rds_info_iterator *iter) > +{ > + if (iter->addr != NULL) { > + kunmap_atomic(iter->addr, KM_USER0); > + iter->addr = NULL; > + } > +} > + This one is used to temporarily map some address, but functions called between map and unmap functions (like rds_info_getsockopt()) may sleep, which is wrong. -- Evgeniy Polyakov From zbr at ioremap.net Tue Jan 27 05:34:19 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Tue, 27 Jan 2009 16:34:19 +0300 Subject: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling In-Reply-To: <1233022678-9259-7-git-send-email-andy.grover@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-7-git-send-email-andy.grover@oracle.com> Message-ID: <20090127133418.GH2646@ioremap.net> On Mon, Jan 26, 2009 at 06:17:43PM -0800, Andy Grover (andy.grover at oracle.com) wrote: > +static inline int rds_conn_is_sending(struct rds_connection *conn) > +{ > + int ret = 0; > + > + if (!mutex_trylock(&conn->c_send_lock)) > + ret = 1; > + else > + mutex_unlock(&conn->c_send_lock); > + > + return ret; > +} > + This one is eventually invoked under the spin_lock with turned off irqs, which may freeze the machine: rds_for_each_conn_info() -> spin_lock_irqsave(global lock) -> rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending() -> boom. I did not not check further though. -- Evgeniy Polyakov From oliver at neukum.org Tue Jan 27 05:47:27 2009 From: oliver at neukum.org (Oliver Neukum) Date: Tue, 27 Jan 2009 14:47:27 +0100 Subject: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling In-Reply-To: <20090127133418.GH2646@ioremap.net> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-7-git-send-email-andy.grover@oracle.com> <20090127133418.GH2646@ioremap.net> Message-ID: <200901271447.29377.oliver@neukum.org> Am Tuesday 27 January 2009 14:34:19 schrieb Evgeniy Polyakov: > On Mon, Jan 26, 2009 at 06:17:43PM -0800, Andy Grover (andy.grover at oracle.com) wrote: > > +static inline int rds_conn_is_sending(struct rds_connection *conn) > > +{ > > + int ret = 0; > > + > > + if (!mutex_trylock(&conn->c_send_lock)) > > + ret = 1; > > + else > > + mutex_unlock(&conn->c_send_lock); > > + > > + return ret; > > +} > > + > > This one is eventually invoked under the spin_lock with turned off irqs, > which may freeze the machine: > rds_for_each_conn_info() -> spin_lock_irqsave(global lock) -> > rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending() > -> boom. Why? This is _trylock. It won't block. Regards Oliver From zbr at ioremap.net Tue Jan 27 05:51:44 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Tue, 27 Jan 2009 16:51:44 +0300 Subject: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling In-Reply-To: <200901271447.29377.oliver@neukum.org> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-7-git-send-email-andy.grover@oracle.com> <20090127133418.GH2646@ioremap.net> <200901271447.29377.oliver@neukum.org> Message-ID: <20090127135144.GC18119@ioremap.net> On Tue, Jan 27, 2009 at 02:47:27PM +0100, Oliver Neukum (oliver at neukum.org) wrote: > > > +static inline int rds_conn_is_sending(struct rds_connection *conn) > > > +{ > > > + int ret = 0; > > > + > > > + if (!mutex_trylock(&conn->c_send_lock)) > > > + ret = 1; > > > + else > > > + mutex_unlock(&conn->c_send_lock); > > > + > > > + return ret; > > > +} > > > + > > > > This one is eventually invoked under the spin_lock with turned off irqs, > > which may freeze the machine: > > rds_for_each_conn_info() -> spin_lock_irqsave(global lock) -> > > rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending() > > -> boom. > > Why? This is _trylock. It won't block. Unlock may reschedule. -- Evgeniy Polyakov From hal.rosenstock at gmail.com Tue Jan 27 08:10:08 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 27 Jan 2009 11:10:08 -0500 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsysstat: use RMPP for client/server communication In-Reply-To: <20090127121911.GA8534@sashak.voltaire.com> References: <20090126212621.GL5814@sashak.voltaire.com> <20090127121911.GA8534@sashak.voltaire.com> Message-ID: Sasha, On Tue, Jan 27, 2009 at 7:19 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 17:14 Mon 26 Jan , Hal Rosenstock wrote: >> On Mon, Jan 26, 2009 at 4:26 PM, Sasha Khapyorsky wrote: >> > >> > This patch adds support for bigger than (256 - vendor2 data offset) data >> > sending by ibsysstat server using RMPP. It fixes bug#1237 - where server >> > output was truncated due to MAD size limitation. >> >> Seems like the class version should be bumped for this change. > > Should it? Class vendor2 permits RMPP and it is defined in the spec as > version 1. I think it was ibsysstat bug/decision to not use/handle it. Just because the class allows RMPP doesn't mean all operations use it so no this wasn't a bug IMO. It was done that way for extensibility (possibility of longer exchanges/using RMPP) which wouldn't have been possible with vendor range 1. >> What's the behavior of old client with new server and new client with >> old server ? > > Basically it works fine together. Old server responds short MAD with > RMPP flags inactive. This is technically a protocol violation even though it may work. > The only interesting case is when old client gets > long RMPP reply from new server - it will grab 256 bytes (so as it was > before the data will be truncated) and server will get timeout from RMPP > layer - and yes, seems we need to handle(drop) this MAD (as well as > possible RMPP unrelated timeout/error on client side). Something like > this: Class version handling could also be used here to handle this. -- Hal > diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c > index c20a6f0..a145daf 100644 > --- a/infiniband-diags/src/ibsysstat.c > +++ b/infiniband-diags/src/ibsysstat.c > @@ -169,6 +174,11 @@ static char *ibsystat_serv(void) > DEBUG("starting to serve..."); > > while ((umad = mad_receive(buf, -1))) { > + if (umad_status(buf)) { > + DEBUG("drop mad with status %x: %s", umad_status(buf), > + strerror(umad_status(buf))); > + continue; > + } > > mad = umad_get_mad(umad); > > @@ -235,6 +245,9 @@ static char *ibsystat(ib_portid_t *portid, int attr) > if (umad_recv(fd, buf, &len, timeout) < 0) > IBPANIC("umad_recv failed."); > > + if (umad_status(buf)) > + return strerror(umad_status(buf)); > + > DEBUG("Got sysstat pong.."); > if (attr != IB_PING_ATTR) > puts(data); > > > Sasha > From swise at opengridcomputing.com Tue Jan 27 08:28:48 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 Jan 2009 10:28:48 -0600 Subject: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling In-Reply-To: <200901271447.29377.oliver@neukum.org> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-7-git-send-email-andy.grover@oracle.com> <20090127133418.GH2646@ioremap.net> <200901271447.29377.oliver@neukum.org> Message-ID: <497F3640.6070600@opengridcomputing.com> Oliver Neukum wrote: > Am Tuesday 27 January 2009 14:34:19 schrieb Evgeniy Polyakov: > >> On Mon, Jan 26, 2009 at 06:17:43PM -0800, Andy Grover (andy.grover at oracle.com) wrote: >> >>> +static inline int rds_conn_is_sending(struct rds_connection *conn) >>> +{ >>> + int ret = 0; >>> + >>> + if (!mutex_trylock(&conn->c_send_lock)) >>> + ret = 1; >>> + else >>> + mutex_unlock(&conn->c_send_lock); >>> + >>> + return ret; >>> +} >>> + >>> >> This one is eventually invoked under the spin_lock with turned off irqs, >> which may freeze the machine: >> rds_for_each_conn_info() -> spin_lock_irqsave(global lock) -> >> rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending() >> -> boom. >> > > Why? This is _trylock. It won't block. > > mutex_trylock() uses spin_lock_mutex() which has this in the debug version: DEBUG_LOCKS_WARN_ON(in_interrupt()); From sashak at voltaire.com Tue Jan 27 09:12:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 27 Jan 2009 19:12:31 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/ibsysstat: use RMPP for client/server communication In-Reply-To: References: <20090126212621.GL5814@sashak.voltaire.com> <20090127121911.GA8534@sashak.voltaire.com> Message-ID: <20090127171231.GD8534@sashak.voltaire.com> Hi Hal, On 11:10 Tue 27 Jan , Hal Rosenstock wrote: > > Just because the class allows RMPP doesn't mean all operations use it Ok, agree. But obviously it doesn't mean that RMPP should not be used. So I think we are fine now. > so no this wasn't a bug IMO. It was done that way for extensibility > (possibility of longer exchanges/using RMPP) which wouldn't have been > possible with vendor range 1. Ok, "bug" I used was too strong word (so I actually used bug/decision :)) > >> What's the behavior of old client with new server and new client with > >> old server ? > > > > Basically it works fine together. Old server responds short MAD with > > RMPP flags inactive. > > This is technically a protocol violation even though it may work. Why? RMPP can be used or not, right? (If not why is flag active/inactive bit needed?) Where do you see a violation? Sasha > > The only interesting case is when old client gets > > long RMPP reply from new server - it will grab 256 bytes (so as it was > > before the data will be truncated) and server will get timeout from RMPP > > layer - and yes, seems we need to handle(drop) this MAD (as well as > > possible RMPP unrelated timeout/error on client side). Something like > > this: > > Class version handling could also be used here to handle this. > > -- Hal > > > diff --git a/infiniband-diags/src/ibsysstat.c b/infiniband-diags/src/ibsysstat.c > > index c20a6f0..a145daf 100644 > > --- a/infiniband-diags/src/ibsysstat.c > > +++ b/infiniband-diags/src/ibsysstat.c > > @@ -169,6 +174,11 @@ static char *ibsystat_serv(void) > > DEBUG("starting to serve..."); > > > > while ((umad = mad_receive(buf, -1))) { > > + if (umad_status(buf)) { > > + DEBUG("drop mad with status %x: %s", umad_status(buf), > > + strerror(umad_status(buf))); > > + continue; > > + } > > > > mad = umad_get_mad(umad); > > > > @@ -235,6 +245,9 @@ static char *ibsystat(ib_portid_t *portid, int attr) > > if (umad_recv(fd, buf, &len, timeout) < 0) > > IBPANIC("umad_recv failed."); > > > > + if (umad_status(buf)) > > + return strerror(umad_status(buf)); > > + > > DEBUG("Got sysstat pong.."); > > if (attr != IB_PING_ATTR) > > puts(data); > > > > > > Sasha > > From jmulik at desu.edu Tue Jan 27 10:37:46 2009 From: jmulik at desu.edu (Jaiwant Mulik) Date: Tue, 27 Jan 2009 13:37:46 -0500 Subject: [ofa-general] ***SPAM*** rping with size over 64 bytes. Message-ID: <6E68F14A-69FA-4A6D-9B8B-0542F0C58E4A@desu.edu> Hi all, Here is the config I am running: ------------------------------ OS: Linux iwarp2 2.6.20-1.2320.fc5 #1 SMP Tue Jun 12 18:50:49 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux OFED: 1.3.1 librdmacm: 1.0.7 hw: Chelsio 302X --------------------------- ulimit -l is set to 128 rping does not work with size > 64 bytes. rping -c -v -a 10.0.0.2 -p 9999 -C 1 -S 64 (works, I can see the output) rping -c -v -a 10.0.0.2 -p 9999 -C 1 -S 65 (does not work) Any suggestions? ------------------------------------------------------------------ Assistant Professor Computer and Information Sciences Department Delaware State University, Dover, DE (302) 857-7910/6640, http://netlab.cis.desu.edu ------------------------------------------------------------------ Lekin woh zindagi hi kya jisme koi namumkin sapna na ho? From swise at opengridcomputing.com Tue Jan 27 10:48:18 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 Jan 2009 12:48:18 -0600 Subject: [ofa-general] ***SPAM*** rping with size over 64 bytes. In-Reply-To: <6E68F14A-69FA-4A6D-9B8B-0542F0C58E4A@desu.edu> References: <6E68F14A-69FA-4A6D-9B8B-0542F0C58E4A@desu.edu> Message-ID: <497F56F2.6070302@opengridcomputing.com> Are you setting the -S 65 on the server side as sell? IE: the parameters must match on the client and server. Jaiwant Mulik wrote: > Hi all, > > Here is the config I am running: > ------------------------------ > OS: Linux iwarp2 2.6.20-1.2320.fc5 #1 SMP Tue Jun 12 18:50:49 EDT 2007 > x86_64 x86_64 x86_64 GNU/Linux > OFED: 1.3.1 > librdmacm: 1.0.7 > hw: Chelsio 302X > --------------------------- > > ulimit -l is set to 128 > > rping does not work with size > 64 bytes. > rping -c -v -a 10.0.0.2 -p 9999 -C 1 -S 64 (works, I can see the output) > rping -c -v -a 10.0.0.2 -p 9999 -C 1 -S 65 (does not work) > > Any suggestions? > > ------------------------------------------------------------------ > Assistant Professor > Computer and Information Sciences Department > Delaware State University, Dover, DE > (302) 857-7910/6640, http://netlab.cis.desu.edu > ------------------------------------------------------------------ > Lekin woh zindagi hi kya jisme koi namumkin sapna na ho? > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From mkatiyar at gmail.com Tue Jan 27 10:28:45 2009 From: mkatiyar at gmail.com (Manish Katiyar) Date: Tue, 27 Jan 2009 23:58:45 +0530 Subject: [ofa-general] ***SPAM*** [PATCH] : Define debugging variables only when CONFIG_INFINIBAND_NES_DEBUG is enabled Message-ID: Below patch removes following compilation warnings : drivers/infiniband/hw/nes/nes_cm.c:781: warning: unused variable 'tmp_addr' drivers/infiniband/hw/nes/nes_cm.c:820: warning: unused variable 'tmp_addr' Signed-off-by: Manish Katiyar --- drivers/infiniband/hw/nes/nes_cm.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index a01b448..2b34859 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -778,7 +778,9 @@ static struct nes_cm_node *find_node(struct nes_cm_core *cm_core, unsigned long flags; struct list_head *hte; struct nes_cm_node *cm_node; +#ifdef CONFIG_INFINIBAND_NES_DEBUG __be32 tmp_addr = cpu_to_be32(loc_addr); +#endif /* get a handle on the hte */ hte = &cm_core->connected_nodes; @@ -817,7 +819,9 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; +#ifdef CONFIG_INFINIBAND_NES_DEBUG __be32 tmp_addr = cpu_to_be32(dst_addr); +#endif /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); -- 1.5.4.3 Thanks - Manish From jaiwant at mulik.com Tue Jan 27 10:52:29 2009 From: jaiwant at mulik.com (Jaiwant Mulik) Date: Tue, 27 Jan 2009 13:52:29 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** rping with size over 64 bytes. In-Reply-To: <07894AD5756EC14BB4573782604815BD451BA19A8F@MAILBOX.desu.edu> References: <6E68F14A-69FA-4A6D-9B8B-0542F0C58E4A@desu.edu> <07894AD5756EC14BB4573782604815BD451BA19A8F@MAILBOX.desu.edu> Message-ID: <798F5C77-A4AD-403D-B69E-47CF471649F4@mulik.com> Ah ... ok. That worked. I was not setting any parameter on the server. On Jan 27, 2009, at 1:45 PM, Steve Wise wrote: > Are you setting the -S 65 on the server side as sell? IE: the > parameters must match on the client and server. > > > Jaiwant Mulik wrote: >> Hi all, >> >> Here is the config I am running: >> ------------------------------ >> OS: Linux iwarp2 2.6.20-1.2320.fc5 #1 SMP Tue Jun 12 18:50:49 EDT >> 2007 >> x86_64 x86_64 x86_64 GNU/Linux >> OFED: 1.3.1 >> librdmacm: 1.0.7 >> hw: Chelsio 302X >> --------------------------- >> >> ulimit -l is set to 128 >> >> rping does not work with size > 64 bytes. >> rping -c -v -a 10.0.0.2 -p 9999 -C 1 -S 64 (works, I can see the >> output) >> rping -c -v -a 10.0.0.2 -p 9999 -C 1 -S 65 (does not work) >> >> Any suggestions? >> >> ------------------------------------------------------------------ >> Assistant Professor >> Computer and Information Sciences Department >> Delaware State University, Dover, DE >> (302) 857-7910/6640, http://netlab.cis.desu.edu >> ------------------------------------------------------------------ >> Lekin woh zindagi hi kya jisme koi namumkin sapna na ho? >> >> >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > ------------------------------------------------------------------ Assistant Professor Computer and Information Sciences Department Delaware State University, Dover, DE (302) 857-7910/6640, http://netlab.cis.desu.edu ------------------------------------------------------------------ Lekin woh zindagi hi kya jisme koi namumkin sapna na ho? From andy.grover at gmail.com Tue Jan 27 11:10:49 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Tue, 27 Jan 2009 11:10:49 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 03/21] RDS: Congestion-handling code In-Reply-To: <20090127131049.GE2646@ioremap.net> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-4-git-send-email-andy.grover@oracle.com> <20090127131049.GE2646@ioremap.net> Message-ID: On Tue, Jan 27, 2009 at 5:10 AM, Evgeniy Polyakov wrote: > On Mon, Jan 26, 2009 at 06:17:40PM -0800, Andy Grover (andy.grover at oracle.com) wrote: >> +/* >> + * Yes, a global lock. It's used so infrequently that it's worth keeping it >> + * global to simplify the locking. It's only used in the following >> + * circumstances: >> + * >> + * - on connection buildup to associate a conn with its maps > > Is this a rare condition? Is this protocol only intended for the > long-living connections and is not suitable for the cases when lots of > them are created and teared down quickly? Connections are long-lived. Imagine a cluster. RDS multiplexes all sockets' datagrams between 2 hosts over a single transport-layer connection, so if a node sends ONE datagram to another, an IB connection is set up and sticks around indefinitely. Regards -- Andy From andy.grover at gmail.com Tue Jan 27 11:15:05 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Tue, 27 Jan 2009 11:15:05 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 03/21] RDS: Congestion-handling code In-Reply-To: <20090126194820.41cdb7f5@extreme> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-4-git-send-email-andy.grover@oracle.com> <20090126194820.41cdb7f5@extreme> Message-ID: On Mon, Jan 26, 2009 at 7:48 PM, Stephen Hemminger wrote: > So this is starting to look like another "Oracle special" like AIO > and HugeTLB. That has lots of caveat restrictions on the application. Yep it's a datacenter-centric protocol. Regards -- Andy From andy.grover at gmail.com Tue Jan 27 11:23:04 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Tue, 27 Jan 2009 11:23:04 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 02/21] RDS: Main header file In-Reply-To: <20090127130543.GD2646@ioremap.net> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-3-git-send-email-andy.grover@oracle.com> <20090127130543.GD2646@ioremap.net> Message-ID: On Tue, Jan 27, 2009 at 5:05 AM, Evgeniy Polyakov wrote: >> +#define RDS_PORT 18634 >> + > > What will happen if some application already uses that port? RDS errors out. Yeah we're going to want to get an assigned port at some point, I guess. Regards -- Andy From swise at opengridcomputing.com Tue Jan 27 11:24:31 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 27 Jan 2009 13:24:31 -0600 Subject: [ofa-general] ***SPAM*** Re: [PATCH 02/21] RDS: Main header file In-Reply-To: References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-3-git-send-email-andy.grover@oracle.com> <20090127130543.GD2646@ioremap.net> Message-ID: <497F5F6F.4050307@opengridcomputing.com> Andrew Grover wrote: > On Tue, Jan 27, 2009 at 5:05 AM, Evgeniy Polyakov wrote: > >>> +#define RDS_PORT 18634 >>> + >>> >> What will happen if some application already uses that port? >> > > RDS errors out. > > Yeah we're going to want to get an assigned port at some point, I guess. > > You should start that process now. It took a while to get nfsrdma's port number through... From andy.grover at gmail.com Tue Jan 27 11:27:34 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Tue, 27 Jan 2009 11:27:34 -0800 Subject: [ofa-general] Re: [PATCH 02/21] RDS: Main header file In-Reply-To: <200901270934.16669.remi.denis-courmont@nokia.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-3-git-send-email-andy.grover@oracle.com> <200901270934.16669.remi.denis-courmont@nokia.com> Message-ID: On Mon, Jan 26, 2009 at 11:34 PM, Rémi Denis-Courmont wrote: >> +#ifndef PF_RDS >> +#define PF_RDS AF_RDS >> +#endif > > You should probably remove that and put the last patch of your series ahead of > this one. Yup will do. >> +#ifndef SOL_RDS >> +#define SOL_RDS 272 >> +#endif > > This is used by RXRPC nowadays, although I myself don't really understand why > socket option levels need to be unique across all families. OK, shouldn't be too hard to fix. Thanks. Regards -- Andy From andy.grover at gmail.com Tue Jan 27 11:29:39 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Tue, 27 Jan 2009 11:29:39 -0800 Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/21] Reliable Datagram Sockets (RDS) In-Reply-To: <497F29A1.60003@opengridcomputing.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <497F29A1.60003@opengridcomputing.com> Message-ID: On Tue, Jan 27, 2009 at 7:34 AM, Steve Wise wrote: > Hey Andy, > > Why didn't you include the iWARP transport as well? Hi Steve, As I mentioned on IRC, there are some ib/iw coexistence issues and other minor bugs to resolve, and then I will include the iWARP code. Regards -- Andy From andy.grover at gmail.com Tue Jan 27 11:31:26 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Tue, 27 Jan 2009 11:31:26 -0800 Subject: [ofa-general] Re: [PATCH 21/21] RDS: Add AF and PF #defines for RDS sockets In-Reply-To: <200901270927.51855.remi.denis-courmont@nokia.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-22-git-send-email-andy.grover@oracle.com> <200901270927.51855.remi.denis-courmont@nokia.com> Message-ID: On Mon, Jan 26, 2009 at 11:27 PM, Rémi Denis-Courmont wrote: > You also need to add lock class declaration to net/core/sock.c, I believe. Very true, thanks. Regards -- Andy From andy.grover at gmail.com Tue Jan 27 11:36:37 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Tue, 27 Jan 2009 11:36:37 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 04/21] RDS: Transport code In-Reply-To: <20090127131808.GF2646@ioremap.net> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-5-git-send-email-andy.grover@oracle.com> <20090127131808.GF2646@ioremap.net> Message-ID: On Tue, Jan 27, 2009 at 5:18 AM, Evgeniy Polyakov wrote: > On Mon, Jan 26, 2009 at 06:17:41PM -0800, Andy Grover (andy.grover at oracle.com) wrote: >> +static LIST_HEAD(transports); >> +static DECLARE_RWSEM(trans_sem); >> + > > RDS_ prefix? Even needed for statics? >> +int rds_trans_register(struct rds_transport *trans) >> +{ >> + BUG_ON(strlen(trans->t_name) + 1 > >> + sizeof(((struct rds_info_connection *)0)->transport)); >> + > > Wow. Why not declare 15 as some constant and put it into rds_transport > structure definition? Makes sense. >> + if (IN_LOOPBACK(ntohl(addr))) >> + return &rds_loop_transport; >> + > > Tabs have run away. Will fix. Thanks. Regards -- Andy From zbr at ioremap.net Tue Jan 27 13:56:58 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Wed, 28 Jan 2009 00:56:58 +0300 Subject: [ofa-general] Re: [PATCH 04/21] RDS: Transport code In-Reply-To: References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-5-git-send-email-andy.grover@oracle.com> <20090127131808.GF2646@ioremap.net> Message-ID: <20090127215657.GD12431@ioremap.net> On Tue, Jan 27, 2009 at 11:36:37AM -0800, Andrew Grover (andy.grover at gmail.com) wrote: > On Tue, Jan 27, 2009 at 5:18 AM, Evgeniy Polyakov wrote: > > On Mon, Jan 26, 2009 at 06:17:41PM -0800, Andy Grover (andy.grover at oracle.com) wrote: > >> +static LIST_HEAD(transports); > >> +static DECLARE_RWSEM(trans_sem); > >> + > > > > RDS_ prefix? > > Even needed for statics? It confuses tags and the like otherwise, and looks more consistent with the rest of the code. Likely it is not a must, but just better look. -- Evgeniy Polyakov From andy.grover at gmail.com Tue Jan 27 14:15:52 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Tue, 27 Jan 2009 14:15:52 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 04/21] RDS: Transport code In-Reply-To: <20090127215657.GD12431@ioremap.net> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-5-git-send-email-andy.grover@oracle.com> <20090127131808.GF2646@ioremap.net> <20090127215657.GD12431@ioremap.net> Message-ID: On Tue, Jan 27, 2009 at 1:56 PM, Evgeniy Polyakov wrote: >> > RDS_ prefix? >> >> Even needed for statics? > > It confuses tags and the like otherwise, and looks more consistent with > the rest of the code. Likely it is not a must, but just better look. Yup, will do, just was curious. Regards -- Andy From rdreier at cisco.com Tue Jan 27 15:53:16 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 27 Jan 2009 15:53:16 -0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <497EF9AC.70104@poczta.onet.pl> (aluno3@poczta.onet.pl's message of "Tue, 27 Jan 2009 13:10:20 +0100") References: <497EF9AC.70104@poczta.onet.pl> Message-ID: Interesting... looks like an unfortunate interaction with unclear locking rules. See below for full explanation. BTW, what workload are you running to hit this? I assume you have CONFIG_HIGHMEM set? > WARNING: at kernel/softirq.c:136 local_bh_enable+0x9b/0xa0() I assume this is WARN_ON_ONCE(in_irq() || irqs_disabled()); The interesting parts of the stack trace seem to be (reversing the order so the story makes sense): [] nes_netdev_start_xmit+0x815/0x8a0 [iw_nes] nes_netdev_start_xmit() calls skb_linearize() for nonlinear skbs it can't handle, which calls __pskb_pull_tail(): [] __pskb_pull_tail+0x5c/0x2e0 __pskb_pull_tail() calls skb_copy_bits(): [] skb_copy_bits+0x155/0x290 At least in some cases, skb_copy_bits() calls kmap_skb_frag() and more to the point kunmap_skb_frag(), which looks like: static inline void kunmap_skb_frag(void *vaddr) { kunmap_atomic(vaddr, KM_SKB_DATA_SOFTIRQ); #ifdef CONFIG_HIGHMEM local_bh_enable(); #endif } which leads to: [] local_bh_enable+0x9b/0xa0 which hits the irqs_disabled() warning because iw_nes is using LLTX, and nes_netdev_start_xmit() does: local_irq_save(flags); if (!spin_trylock(&nesnic->sq_lock)) { at the very beginning. The best solution is probably for iw_nes to stop using LLTX and use the main netdev lock... but actually I still don't see how it's safe for a net driver to call skb_linearize() from its transmit routine, since there's a chance that that will unconditionally enable BHs? - R. From davem at davemloft.net Tue Jan 27 16:07:50 2009 From: davem at davemloft.net (David Miller) Date: Tue, 27 Jan 2009 16:07:50 -0800 (PST) Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: References: <497EF9AC.70104@poczta.onet.pl> Message-ID: <20090127.160750.120120703.davem@davemloft.net> From: Roland Dreier Date: Tue, 27 Jan 2009 15:53:16 -0800 > but actually I still don't see how it's safe for a net driver to > call skb_linearize() from its transmit routine, since there's a > chance that that will unconditionally enable BHs? It's simply not allowed. dev_queue_xmit() at a higher level can do __skb_linearize() because it does so before doing the rcu_read_lock_bh(). From shemminger at vyatta.com Tue Jan 27 16:17:09 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Tue, 27 Jan 2009 16:17:09 -0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090127.160750.120120703.davem@davemloft.net> References: <497EF9AC.70104@poczta.onet.pl> <20090127.160750.120120703.davem@davemloft.net> Message-ID: <20090127161709.25072f82@extreme> On Tue, 27 Jan 2009 16:07:50 -0800 (PST) David Miller wrote: > From: Roland Dreier > Date: Tue, 27 Jan 2009 15:53:16 -0800 > > > but actually I still don't see how it's safe for a net driver to > > call skb_linearize() from its transmit routine, since there's a > > chance that that will unconditionally enable BHs? > > It's simply not allowed. > > dev_queue_xmit() at a higher level can do __skb_linearize() > because it does so before doing the rcu_read_lock_bh(). If the device driver can't handle non-linear SKB's then it should not set NETIF_F_SG. From PHF at zurich.ibm.com Wed Jan 28 01:52:27 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Wed, 28 Jan 2009 10:52:27 +0100 Subject: [ofa-general] 0-length RDMA Read Message-ID: Hi, since for iWARP, that after the MPA connection establishment, the MPA initiator must send the first FPDU, I wanted to do that using a 0-length RDMA Read. When using the T3 Chelsio RNIC, I end up with a segmentation fault from the libcxgb3 (function t3b_post_send). I was trying to post a 0-lenght WR for RDMA Read like this: struct ibv_send_wr wr; wr.wr_id = 1; wr.next = NULL; wr.sg_list = NULL; wr.num_sge = 0; wr.wr.rdma.remote_addr = 0; wr.wr.rdma.rkey = 0; wr.opcode = IBV_WR_RDMA_READ; wr.send_flags = IBV_SEND_SIGNALED; ibv_post_send(qp, &wr, &bad_wr); Question1: Are 0-length RDMA Reads supported at all by the T3? Question2: If they are, how do I have to write a correct send WR? Many thanks for your advice, Philip -- Philip Frey IBM Zurich Research Laboratory Saumerstrasse 4 | Phone: +41 44 724 8613 CH-8803 Rueschlikon/Switzerland | Email: phf at zurich.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Wed Jan 28 03:14:08 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 28 Jan 2009 03:14:08 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090128-0200 daily build status Message-ID: <20090128111408.A3941E60CB0@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From dledford at redhat.com Wed Jan 28 05:57:36 2009 From: dledford at redhat.com (Doug Ledford) Date: Wed, 28 Jan 2009 08:57:36 -0500 Subject: [ofa-general] Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <4978EE0E.5050209@opengridcomputing.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> <4978EE0E.5050209@opengridcomputing.com> Message-ID: <1233151056.5637.2.camel@firewall.xsintricity.com> On Thu, 2009-01-22 at 16:07 -0600, Steve Wise wrote: > I understand the desire to not release new features in a point release, > but at the same time, these features are ready or near ready now. And > prior features have definitely been released in point releases. > (connectX for example). Another key point is that these features do not > need the kernel rebase that will happen with ofed-1.5, which will take > months... > > Just more thoughts. :) I'm a bit late to this discussion, and you may have already talked about this in the ewg teleconference, but I want to throw in my thoughts. As far as new features goes, adding ConnectX support in a point release is a huge difference from switching OpenMPI releases from a stable series to the .0 release of the next series. In the case of ConnectX, it was "just another driver" and its addition should have had almost 0 impact on anyone not using that driver. On the other hand, switching OpenMPI versions changes the OpenMPI stack for everyone and has the potential to create wide spread regressions should something go wrong. So the risk factor comparison between these two actions simply isn't valid. One doesn't risk regressions for non-ConnectX users, one risks regressions for everyone using OpenSM. > Steve. > > > Woodruff, Robert J wrote: > > I think that we need to discuss this in the EWG meeting. > > In the past I think that we have agreed to only do bug fixes > > in point release and not add major new features. > > If we do want to include the new MPI, then perhaps we should call > > it 1.5 and pull in the schedule for 1.5. Just a thought. > > > > woody > > > > > > -----Original Message----- > > From: Steve Wise [mailto:swise at opengridcomputing.com] > > Sent: Thursday, January 22, 2009 1:46 PM > > To: John Russo > > Cc: Woodruff, Robert J; general at lists.openfabrics.org; ewg at lists.openfabrics.org > > Subject: Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x > > > > I think releasing OMPI-1.3 with iWARP support is also good justification. > > > > And there are RDS issues with ofed-1.4 even over IB that I think will > > add to justification. > > > > > > John Russo wrote: > > > >> I understand but I think that this is another consideration that should be factored in. Even if there are no "critical" PRs to fix, the introduction of RHEL 5.3 (along with less critical PRs) may be enough justification. > >> > >> I simply want to plant the seed in everyone's mind before our next meeting. > >> > >> Thanks > >> > >> -----Original Message----- > >> From: Woodruff, Robert J [mailto:robert.j.woodruff at intel.com] > >> Sent: Thursday, January 22, 2009 3:44 PM > >> To: John Russo; general at lists.openfabrics.org > >> Cc: ewg at lists.openfabrics.org > >> Subject: RE: RHEL 5.3 and OFED 1.4.x > >> > >> In the last EWG meeting, we discussed waiting a month or so and seeing what kind of bugs > >> were reported against 1.4 to determine if a 1.4.1 release was needed. > >> > >> > >> ________________________________ > >> > >> From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of John Russo > >> Sent: Thursday, January 22, 2009 12:37 PM > >> To: general at lists.openfabrics.org > >> Subject: [ofa-general] RHEL 5.3 and OFED 1.4.x > >> > >> > >> > >> Does the release of RHEL 5.3 create any additional justification for a maintenance release of OFED (1.4.1) to be generated? I am already hearing requests for an OFED release that will support it. > >> > >> > >> > >> John Russo > >> > >> QLogic > >> > >> _______________________________________________ > >> ewg mailing list > >> ewg at lists.openfabrics.org > >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > >> > >> > > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From ogerlitz at voltaire.com Wed Jan 28 06:44:29 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 28 Jan 2009 16:44:29 +0200 (IST) Subject: [ofa-general] problems using smpdump Message-ID: Hi Sasha, I'm having some problems with smpdump when used with the Mellanox IS4 switch, for example for nodeinfo (0x11), both smpquery and smpdump when run against HCA in this form $ smpquery nodeinfo LID $ smpdump LID 0x11 produce the --same-- response mad, but when run against IS4 smpdump doesn't return anything. Invoking both with -ddd I noticed these two lines: ibwarn: [30549] umad_set_addr: umad 0xae0c010 dlid 12 dqp 0 sl 65535, qkey 0 ibwarn: [30549] umad_addr_dump: qpn 0 qkey 0x0 lid 0xc sl 255 so the SLs used by smpdump are wierd (255 and 65535), is it just print-errors or can suggest a possible explanation for the failure? I am using the latest mng git, did all the runs from a host, see the runs info and also the IS4 FW/etc info (mstflint q) below. Or. $ /home/ogerlitz/ib-mng/sbin/ibnetdiscover -P 2 # # Topology file: generated on Wed Jan 28 16:22:10 2009 # # Max of 2 hops discovered # Initiated from node 0002c90300026be2 port 0002c90300026be4 vendid=0x2c9 devid=0xbd36 sysimgguid=0x8f100010c0063 switchguid=0x8f100010c0062(8f100010c0062) Switch 36 "S-0008f100010c0062" # "Infiniscale-IV Mellanox Technologies" base port 0 lid 12 lmc 0 [27] "H-0002c90300026be6"[2](2c90300026be8) # "linux-cto-1 HCA-1" lid 7 4xDDR [6] "H-0002c90300026be2"[2](2c90300026be4) # " HCA-1" lid 3 4xDDR vendid=0x2c9 devid=0x6732 sysimgguid=0x2c90300026be9 caguid=0x2c90300026be6 Ca 2 "H-0002c90300026be6" # "linux-cto-1 HCA-1" [2](2c90300026be8) "S-0008f100010c0062"[27] # lid 7 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 12 4xDDR vendid=0x2c9 devid=0x6732 sysimgguid=0x2c90300026be5 caguid=0x2c90300026be2 Ca 2 "H-0002c90300026be2" # " HCA-1" [2](2c90300026be4) "S-0008f100010c0062"[6] # lid 3 lmc 0 "Infiniscale-IV Mellanox Technologies" lid 12 4xDDR $ /home/ogerlitz/ib-mng/sbin/smpquery -P 2 nodeinfo 12 # Node info: Lid 12 BaseVers:........................1 ClassVers:.......................1 NodeType:........................Switch NumPorts:........................36 SystemGuid:......................0x0008f100010c0063 Guid:............................0x0008f100010c0062 PortGuid:........................0x0008f100010c0062 PartCap:.........................8 DevId:...........................0xbd36 Revision:........................0x000000a0 LocalPort:.......................6 VendorId:........................0x0002c9 $ /home/ogerlitz/ib-mng/sbin/smpdump -P 2 12 0x11 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 $ /home/ogerlitz/ib-mng/sbin/smpdump -ddd -P 2 12 0x11 before send: 0101 0101 0000 0000 0000 0000 0000 0123 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ibwarn: [30549] umad_init: umad_init ibwarn: [30549] umad_open_port: ca (null) port 2 ibwarn: [30549] umad_get_cas_names: max 20 ibwarn: [30549] umad_get_cas_names: return 1 cas ibwarn: [30549] resolve_ca_name: checking ca 'mlx4_0' ibwarn: [30549] resolve_ca_port: checking ca 'mlx4_0' ibwarn: [30549] umad_get_ca: ca_name mlx4_0 ibwarn: [30549] umad_get_ca: opened mlx4_0 ibwarn: [30549] resolve_ca_name: found ca mlx4_0 with port 2 type 1 ibwarn: [30549] resolve_ca_name: found ca mlx4_0 with active port 2 ibwarn: [30549] umad_open_port: opening mlx4_0 port 2 ibwarn: [30549] dev_to_umad_id: mapped mlx4_0 2 to 1 ibwarn: [30549] umad_open_port: opened /dev/infiniband/umad1 fd 3 portid 1 ibwarn: [30549] umad_register: fd 3 mgmt_class 1 mgmt_version 1 rmpp_version 0 method_mask (nil) ibwarn: [30549] umad_register: fd 3 registered to use agent 0 qp 0 ibwarn: [30549] umad_set_addr: umad 0xae0c010 dlid 12 dqp 0 sl 65535, qkey 0 ibwarn: [30549] umad_send: fd 3 agentid 0 umad 0xae0c010 timeout 1000 ibwarn: [30549] umad_dump: agent id 0 status 0 timeout 1000 ibwarn: [30549] umad_addr_dump: qpn 0 qkey 0x0 lid 0xc sl 255 grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0 Gid 0x00000000000000000000000000000000 ibwarn: [30549] umad_recv: fd 3 umad 0xae0c010 timeout 4294967295 ibwarn: [30549] umad_recv: mad received by agent 0 length 88 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 $ /home/ogerlitz/ib-mng/sbin/smpquery -ddd -P 2 nodeinfo 12 send buf 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 000c 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0101 0101 0000 0000 4b6e eb31 6545 c29drcv buf 0101 0181 0000 0000 0000 019d 6545 c29d 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0101 0224 0008 f100 010c 0063 0008 f100 010c 0062 0008 f100 010c 0062 0008 bd36 0000 00a0 0600 02c9 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 mad data 0101 0224 0008 f100 010c 0063 0008 f100 010c 0062 0008 f100 010c 0062 0008 bd36 0000 00a0 0600 02c9 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ibwarn: [30756] umad_init: umad_init ibwarn: [30756] umad_open_port: ca (null) port 2 ibwarn: [30756] umad_get_cas_names: max 20 ibwarn: [30756] umad_get_cas_names: return 1 cas ibwarn: [30756] resolve_ca_name: checking ca 'mlx4_0' ibwarn: [30756] resolve_ca_port: checking ca 'mlx4_0' ibwarn: [30756] umad_get_ca: ca_name mlx4_0 ibwarn: [30756] umad_get_ca: opened mlx4_0 ibwarn: [30756] resolve_ca_name: found ca mlx4_0 with port 2 type 1 ibwarn: [30756] resolve_ca_name: found ca mlx4_0 with active port 2 ibwarn: [30756] umad_open_port: opening mlx4_0 port 2 ibwarn: [30756] dev_to_umad_id: mapped mlx4_0 2 to 1 ibwarn: [30756] umad_open_port: opened /dev/infiniband/umad1 fd 3 portid 1 ibwarn: [30756] umad_register: fd 3 mgmt_class 1 mgmt_version 1 rmpp_version 0 method_mask (nil) ibwarn: [30756] umad_register: fd 3 registered to use agent 0 qp 0 ibwarn: [30756] umad_register: fd 3 mgmt_class 129 mgmt_version 1 rmpp_version 0 method_mask (nil) ibwarn: [30756] umad_register: fd 3 registered to use agent 1 qp 0 ibwarn: [30756] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil) ibwarn: [30756] umad_register: fd 3 registered to use agent 2 qp 1 ibwarn: [30756] smp_query_via: attr 0x11 mod 0x0 route Lid 12 ibwarn: [30756] umad_set_addr: umad 0x7fffe43460c0 dlid 12 dqp 0 sl 0, qkey 0 ibwarn: [30756] _do_madrpc: >>> sending: len 256 pktsz 320 ibwarn: [30756] umad_send: fd 3 agentid 0 umad 0x7fffe43460c0 timeout 1000 ibwarn: [30756] umad_dump: agent id 0 status 0 timeout 1000 ibwarn: [30756] umad_addr_dump: qpn 0 qkey 0x0 lid 0xc sl 0 grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0 Gid 0x00000000000000000000000000000000 ibwarn: [30756] umad_recv: fd 3 umad 0x7fffe4345cc0 timeout 1000 ibwarn: [30756] umad_recv: mad received by agent 0 length 320 ibwarn: [30756] _do_madrpc: rcv buf: ibwarn: [30756] mad_rpc: data offs 64 sz 64 # Node info: Lid 12 BaseVers:........................1 ClassVers:.......................1 NodeType:........................Switch NumPorts:........................36 SystemGuid:......................0x0008f100010c0063 Guid:............................0x0008f100010c0062 PortGuid:........................0x0008f100010c0062 PartCap:.........................8 DevId:...........................0xbd36 Revision:........................0x000000a0 LocalPort:.......................6 VendorId:........................0x0002c9 $ /home/ogerlitz/ib-mng/sbin/smpquery -ddd -P 2 nodeinfo 7 send buf 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0007 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0101 0101 0000 0000 0c26 8ddd 34d6 34b1 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 rcv buf 0101 0181 0000 0000 0000 01a6 34d6 34b1 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0101 0102 0002 c903 0002 6be9 0002 c903 0002 6be6 0002 c903 0002 6be8 0080 6732 0000 00a0 0200 02c9 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 mad data 0101 0102 0002 c903 0002 6be9 0002 c903 0002 6be6 0002 c903 0002 6be8 0080 6732 0000 00a0 0200 02c9 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ibwarn: [32714] umad_init: umad_init ibwarn: [32714] umad_open_port: ca (null) port 2 ibwarn: [32714] umad_get_cas_names: max 20 ibwarn: [32714] umad_get_cas_names: return 1 cas ibwarn: [32714] resolve_ca_name: checking ca 'mlx4_0' ibwarn: [32714] resolve_ca_port: checking ca 'mlx4_0' ibwarn: [32714] umad_get_ca: ca_name mlx4_0 ibwarn: [32714] umad_get_ca: opened mlx4_0 ibwarn: [32714] resolve_ca_name: found ca mlx4_0 with port 2 type 1 ibwarn: [32714] resolve_ca_name: found ca mlx4_0 with active port 2 ibwarn: [32714] umad_open_port: opening mlx4_0 port 2 ibwarn: [32714] dev_to_umad_id: mapped mlx4_0 2 to 1 ibwarn: [32714] umad_open_port: opened /dev/infiniband/umad1 fd 3 portid 1 ibwarn: [32714] umad_register: fd 3 mgmt_class 1 mgmt_version 1 rmpp_version 0 method_mask (nil) ibwarn: [32714] umad_register: fd 3 registered to use agent 0 qp 0 ibwarn: [32714] umad_register: fd 3 mgmt_class 129 mgmt_version 1 rmpp_version 0 method_mask (nil) ibwarn: [32714] umad_register: fd 3 registered to use agent 1 qp 0 ibwarn: [32714] umad_register: fd 3 mgmt_class 3 mgmt_version 2 rmpp_version 1 method_mask (nil) ibwarn: [32714] umad_register: fd 3 registered to use agent 2 qp 1 ibwarn: [32714] smp_query_via: attr 0x11 mod 0x0 route Lid 7 ibwarn: [32714] umad_set_addr: umad 0x7fff2ae3abb0 dlid 7 dqp 0 sl 0, qkey 0 ibwarn: [32714] _do_madrpc: >>> sending: len 256 pktsz 320 ibwarn: [32714] umad_send: fd 3 agentid 0 umad 0x7fff2ae3abb0 timeout 1000 ibwarn: [32714] umad_dump: agent id 0 status 0 timeout 1000 ibwarn: [32714] umad_addr_dump: qpn 0 qkey 0x0 lid 0x7 sl 0 grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0 Gid 0x00000000000000000000000000000000 ibwarn: [32714] umad_recv: fd 3 umad 0x7fff2ae3a7b0 timeout 1000 ibwarn: [32714] umad_recv: mad received by agent 0 length 320 ibwarn: [32714] _do_madrpc: rcv buf: ibwarn: [32714] mad_rpc: data offs 64 sz 64 # Node info: Lid 7 BaseVers:........................1 ClassVers:.......................1 NodeType:........................Channel Adapter NumPorts:........................2 SystemGuid:......................0x0002c90300026be9 Guid:............................0x0002c90300026be6 PortGuid:........................0x0002c90300026be8 PartCap:.........................128 DevId:...........................0x6732 Revision:........................0x000000a0 LocalPort:.......................2 VendorId:........................0x0002c9 $ /home/ogerlitz/ib-mng/sbin/smpdump -ddd -P 2 7 0x11 before send: 0101 0101 0000 0000 0000 0000 0000 0123 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ibwarn: [305] umad_init: umad_init ibwarn: [305] umad_open_port: ca (null) port 2 ibwarn: [305] umad_get_cas_names: max 20 ibwarn: [305] umad_get_cas_names: return 1 cas ibwarn: [305] resolve_ca_name: checking ca 'mlx4_0' ibwarn: [305] resolve_ca_port: checking ca 'mlx4_0' ibwarn: [305] umad_get_ca: ca_name mlx4_0 ibwarn: [305] umad_get_ca: opened mlx4_0 ibwarn: [305] resolve_ca_name: found ca mlx4_0 with port 2 type 1 ibwarn: [305] resolve_ca_name: found ca mlx4_0 with active port 2 ibwarn: [305] umad_open_port: opening mlx4_0 port 2 ibwarn: [305] dev_to_umad_id: mapped mlx4_0 2 to 1 ibwarn: [305] umad_open_port: opened /dev/infiniband/umad1 fd 3 portid 1 ibwarn: [305] umad_register: fd 3 mgmt_class 1 mgmt_version 1 rmpp_version 0 method_mask (nil) ibwarn: [305] umad_register: fd 3 registered to use agent 0 qp 0 ibwarn: [305] umad_set_addr: umad 0x16174010 dlid 7 dqp 0 sl 65535, qkey 0 ibwarn: [305] umad_send: fd 3 agentid 0 umad 0x16174010 timeout 1000 ibwarn: [305] umad_dump: agent id 0 status 0 timeout 1000 ibwarn: [305] umad_addr_dump: qpn 0 qkey 0x0 lid 0x7 sl 255 grh_present 0 gid_index 0 hop_limit 0 traffic_class 0 flow_label 0x0 pkey_index 0x0 Gid 0x00000000000000000000000000000000 ibwarn: [305] umad_recv: fd 3 umad 0x16174010 timeout 4294967295 ibwarn: [305] umad_recv: mad received by agent 0 length 320 0101 0102 0002 c903 0002 6be9 0002 c903 0002 6be6 0002 c903 0002 6be8 0080 6732 0000 00a0 0200 02c9 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 --> here's the IS4 info # mstflint -d 81:00.0 q Image type: FS2 FW Version: 7.1.0 Device ID: 48438 Chip Revision: A0 Description: Node Sys image GUIDs: 0008f100010c0062 0008f100010c0063 Board ID: (MT_0C20110003) VSD: PSID: MT_0C20110003 From swise at opengridcomputing.com Wed Jan 28 07:33:05 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 28 Jan 2009 09:33:05 -0600 Subject: [ofa-general] Re: 0-length RDMA Read In-Reply-To: References: Message-ID: <49807AB1.3070308@opengridcomputing.com> This looks like a bug. The lib assumes an SGE entry will be provided. A workaround for now is to set num_sge to 1 and initialize the sge entry to: sge.addr 0 sge.lkey 2 sge.length 0 rkey 2 remote_addr 0 I'll fix this in libcxgb3 to allow num_sge == 0 to mean 0B read. Also, right now you need to specify non-zero (and yet still valid possible) lkey and rkey values. I'll fix this too so if length is 0 or num_sge is 0, then the library will create a valid 0B read request for you ignoring the other fields. I opened bug 1496 for this: https://bugs.openfabrics.org/show_bug.cgi?id=1496 Lemme know if you want a new libcxgb3 tarball with the fix. By the way, as of ofed-1.3.1 and 2.6.27 kernels, iw_cxgb3 supports a mode where it handles this client-must-send-first issue for you. There is a module option called peer2peer. Set it to 1 and all subsequent connections will handle this by doing a 0B read from the client. 'echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer' will do the trick... Steve. Philip Frey1 wrote: > > Hi, > > since for iWARP, that after the MPA connection establishment, the > MPA initiator must send the first FPDU, I wanted to do that using a > 0-length > RDMA Read. When using the T3 Chelsio RNIC, I end up with a segmentation > fault from the libcxgb3 (function t3b_post_send). > > I was trying to post a 0-lenght WR for RDMA Read like this: > > struct ibv_send_wr wr; > > wr.wr_id = 1; > wr.next = NULL; > wr.sg_list = NULL; > wr.num_sge = 0; > wr.wr.rdma.remote_addr = 0; > wr.wr.rdma.rkey = 0; > wr.opcode = IBV_WR_RDMA_READ; > wr.send_flags = IBV_SEND_SIGNALED; > > ibv_post_send(qp, &wr, &bad_wr); > > Question1: Are 0-length RDMA Reads supported at all by the T3? > Question2: If they are, how do I have to write a correct send WR? > > Many thanks for your advice, > Philip > > > > -- > Philip Frey > IBM Zurich Research Laboratory > Saumerstrasse 4 | Phone: +41 44 > 724 8613 > CH-8803 Rueschlikon/Switzerland | Email: phf at zurich.ibm.com From swise at opengridcomputing.com Wed Jan 28 07:38:12 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 28 Jan 2009 09:38:12 -0600 Subject: [ofa-general] Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <1233151056.5637.2.camel@firewall.xsintricity.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> <4978EE0E.5050209@opengridcomputing.com> <1233151056.5637.2.camel@firewall.xsintricity.com> Message-ID: <49807BE4.40305@opengridcomputing.com> Doug Ledford wrote: > On Thu, 2009-01-22 at 16:07 -0600, Steve Wise wrote: > >> I understand the desire to not release new features in a point release, >> but at the same time, these features are ready or near ready now. And >> prior features have definitely been released in point releases. >> (connectX for example). Another key point is that these features do not >> need the kernel rebase that will happen with ofed-1.5, which will take >> months... >> >> Just more thoughts. :) >> > > I'm a bit late to this discussion, and you may have already talked about > this in the ewg teleconference, but I want to throw in my thoughts. > > As far as new features goes, adding ConnectX support in a point release > is a huge difference from switching OpenMPI releases from a stable > series to the .0 release of the next series. In the case of ConnectX, > it was "just another driver" and its addition should have had almost 0 > impact on anyone not using that driver. On the other hand, switching > OpenMPI versions changes the OpenMPI stack for everyone and has the > potential to create wide spread regressions should something go wrong. > So the risk factor comparison between these two actions simply isn't > valid. One doesn't risk regressions for non-ConnectX users, one risks > regressions for everyone using OpenSM. > > Good points. One way to alleviate this is to ship both 1.2.8 and 1.3 in ofed-1.4.1 and mark 1.3 as "experimental". Then remove 1.2.8 in ofed-1.5 and make 1.3.x the production version for ofed-1.5. I suggested this in the last conf call but folks didn't like the thought of testing both. But perhaps marking it "experimental" resolves this issue? So the iWARP vendors will test 1.3 and little to no testing is required for 1.2.8 since it has been qualified with ofed-1.4 QA. Steve. From dledford at redhat.com Wed Jan 28 07:58:30 2009 From: dledford at redhat.com (Doug Ledford) Date: Wed, 28 Jan 2009 10:58:30 -0500 Subject: [ofa-general] Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <49807BE4.40305@opengridcomputing.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> <4978EE0E.5050209@opengridcomputing.com> <1233151056.5637.2.camel@firewall.xsintricity.com> <49807BE4.40305@opengridcomputing.com> Message-ID: <1233158310.5637.44.camel@firewall.xsintricity.com> On Wed, 2009-01-28 at 09:38 -0600, Steve Wise wrote: > Doug Ledford wrote: > > On Thu, 2009-01-22 at 16:07 -0600, Steve Wise wrote: > > > >> I understand the desire to not release new features in a point release, > >> but at the same time, these features are ready or near ready now. And > >> prior features have definitely been released in point releases. > >> (connectX for example). Another key point is that these features do not > >> need the kernel rebase that will happen with ofed-1.5, which will take > >> months... > >> > >> Just more thoughts. :) > >> > > > > I'm a bit late to this discussion, and you may have already talked about > > this in the ewg teleconference, but I want to throw in my thoughts. > > > > As far as new features goes, adding ConnectX support in a point release > > is a huge difference from switching OpenMPI releases from a stable > > series to the .0 release of the next series. In the case of ConnectX, > > it was "just another driver" and its addition should have had almost 0 > > impact on anyone not using that driver. On the other hand, switching > > OpenMPI versions changes the OpenMPI stack for everyone and has the > > potential to create wide spread regressions should something go wrong. > > So the risk factor comparison between these two actions simply isn't > > valid. One doesn't risk regressions for non-ConnectX users, one risks > > regressions for everyone using OpenSM. > > > > > Good points. > > One way to alleviate this is to ship both 1.2.8 and 1.3 in ofed-1.4.1 > and mark 1.3 as "experimental". Then remove 1.2.8 in ofed-1.5 and make > 1.3.x the production version for ofed-1.5. That's certainly doable IMO. > I suggested this in the last conf call but folks didn't like the thought > of testing both. But perhaps marking it "experimental" resolves this > issue? So the iWARP vendors will test 1.3 and little to no testing is > required for 1.2.8 since it has been qualified with ofed-1.4 QA. What about adding some automated tests using mpitests? Both automated build tests (which does some amount of testing of the mpicc et. al. wrappers) and run tests (which would require a slightly more sophisticated test harness in that it at least needs to know about machines to run the tests over, etc)? In fact, while I'm at it, let me attach my Makefile patch I use against the mpitests-3.1 package in OFED 1.4. It greatly simplifies the make environment and does something that I think the mpitests package *should* do but currently doesn't without my patch: test the mpicc wrappers. The current Makefiles set all sorts of MPIHOME and CC and other variables...these are all things that mpicc *should* take care of for you and *not* using plain mpicc in the mpitests Makefiles simply ignores one aspect of the testing that is perfectly valid and means you have to validate your mpi build environment separately. I would suggest that this patch, or something like it, be applied to the build environment for mpitests. Is the person responsible for that tarball on these lists? -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: mpitests-2.0-make.patch Type: text/x-patch Size: 5605 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From chien.tin.tung at intel.com Wed Jan 28 08:36:43 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 28 Jan 2009 09:36:43 -0700 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090127161709.25072f82@extreme> References: <497EF9AC.70104@poczta.onet.pl> <20090127.160750.120120703.davem@davemloft.net> <20090127161709.25072f82@extreme> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA38303206977CD@azsmsx501.amr.corp.intel.com> Thank you for all the feedback. We are looking into the issue. Chien From rdreier at cisco.com Wed Jan 28 10:05:29 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 10:05:29 -0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090127.160750.120120703.davem@davemloft.net> (David Miller's message of "Tue, 27 Jan 2009 16:07:50 -0800 (PST)") References: <497EF9AC.70104@poczta.onet.pl> <20090127.160750.120120703.davem@davemloft.net> Message-ID: > > but actually I still don't see how it's safe for a net driver to > > call skb_linearize() from its transmit routine, since there's a > > chance that that will unconditionally enable BHs? > > It's simply not allowed. > > dev_queue_xmit() at a higher level can do __skb_linearize() > because it does so before doing the rcu_read_lock_bh(). OK, thanks... what confused me is that several other drivers also do skb_linearize() in their hard_start_xmit method... eg bnx2x, via-velocity, mv643xx_eth. So there are several other lurking bugs to deal with here I guess. - R. From sashak at voltaire.com Wed Jan 28 10:19:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 28 Jan 2009 20:19:24 +0200 Subject: [ofa-general] Re: problems using smpdump In-Reply-To: References: Message-ID: <20090128181924.GF8534@sashak.voltaire.com> Hi Or, On 16:44 Wed 28 Jan , Or Gerlitz wrote: > > I'm having some problems with smpdump when used with the Mellanox IS4 > switch, for example for nodeinfo (0x11), both smpquery and smpdump when > run against HCA in this form > > $ smpquery nodeinfo LID > $ smpdump LID 0x11 > > produce the --same-- response mad, but when run against IS4 smpdump doesn't > return anything. Invoking both with -ddd I noticed these two lines: > > ibwarn: [30549] umad_set_addr: umad 0xae0c010 dlid 12 dqp 0 sl 65535, qkey 0 > ibwarn: [30549] umad_addr_dump: qpn 0 qkey 0x0 lid 0xc sl 255 > > so the SLs used by smpdump are wierd (255 and 65535), is it just > print-errors or can suggest a possible explanation for the failure? It is a bug in smpdump (not just print error), it tries to set sl explicitly to 0xffff when using LID routed MADs (so I suppose direct routed MAD will work: smpdump -D 0,2 0x11 ). And seems this bug was from day "0" of smpdump. Guess that this should be a patch: diff --git a/infiniband-diags/src/smpdump.c b/infiniband-diags/src/smpdump.c index f195690..8618121 100644 --- a/infiniband-diags/src/smpdump.c +++ b/infiniband-diags/src/smpdump.c @@ -121,7 +121,7 @@ smp_get_init(void *umad, int lid, int attr, int mod) smp->attr_mod = htonl(mod); smp->tid = htonll(drmad_tid++); - umad_set_addr(umad, lid, 0, 0xffff, 0); + umad_set_addr(umad, lid, 0, 0, 0); } void Sasha From chu11 at llnl.gov Wed Jan 28 10:41:49 2009 From: chu11 at llnl.gov (Al Chu) Date: Wed, 28 Jan 2009 10:41:49 -0800 Subject: [ofa-general] [ipoib][patch] handle pkey input to create_child and delete_child consistently Message-ID: <1233168109.570.88.camel@auk31.llnl.gov> (Didn't get a response from an earlier post, this is a repost w/ a rebased patch) I noticed that the pkey is handled differently between ipoib's create_child and delete_child functions. So a user can create a interface with a pkey, but not delete it with the same pkey. Sort of makes it confusing for the average person. # sys/class/net/ib0 > echo 0x6fff > create_child # /sys/class/net/ib0 > echo 0x6fff > delete_child -bash: echo: write error: No such file or directory # /sys/class/net/ib0 > echo 0xefff > delete_child # /sys/class/net/ib0 > The attached patch simply bitwise-ORs the full membership bit into the delete_child function for consistency. A check for a valid full- membership bit on the create_child function would be fine as well, but IMO this is the lesser confusing option (and is backwards compatible to any scripts people have already written). Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Set-the-pkey-full-membership-bit-so-the-input-to-del.patch Type: application/mbox Size: 1148 bytes Desc: not available URL: From chu11 at llnl.gov Wed Jan 28 10:41:56 2009 From: chu11 at llnl.gov (Al Chu) Date: Wed, 28 Jan 2009 10:41:56 -0800 Subject: [ofa-general] [ipoib][patch] support default_pkey module option Message-ID: <1233168116.570.89.camel@auk31.llnl.gov> (Didn't get a response from an earlier post, this is a repost w/ a rebased patch) Hi all, As far as I can tell, the only way to create an ipoib interface w/ a non-default pkey is to: 1) goto /sys/class/net/ibX/ 2) echo $MY_NEW_PKEY > create_child 3) then bring up ibX.MY_NEW_PKEY interface 3a) assuming I don't want the original ib0, bring it down (although leaving it up, I guess may not harm anything) It seems somewhat cumbersome for an administrator to script this all up if they only want 1 ipoib interface w/ a non-default pkey. The attached patch creates a module option called "default_pkey" to allow ipoib to default to a different pkey. If nothing is input, it still uses the pkey at index 0. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-support-ipoib-default_pkey-module-option-to-suppor.patch Type: application/mbox Size: 1935 bytes Desc: not available URL: From rdreier at cisco.com Wed Jan 28 10:53:12 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 10:53:12 -0800 Subject: [ofa-general] Re: IPoIB kernel Oops -- possible race condition identified. In-Reply-To: <200901271107.03074.jackm@dev.mellanox.co.il> (Jack Morgenstein's message of "Tue, 27 Jan 2009 11:07:02 +0200") References: <200901261741.08824.jackm@dev.mellanox.co.il> <497DEC1A.2030104@Voltaire.COM> <200901271107.03074.jackm@dev.mellanox.co.il> Message-ID: > - priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > + spin_lock_irq(&priv->lock); > + if (priv->broadcast) > + priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > + else > + priv->mcast_mtu = priv->admin_mtu; Looks reasonable, except what about the code shortly before that in ipoib_mcast_join_task(): if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) ipoib_mcast_join(dev, priv->broadcast, 0); return; } is there any reason why that is safe without locking around the using priv->broadcast? (Fixing that place looks harder, since ipoib_mcast_join() currently may sleep) - R. From shemminger at vyatta.com Wed Jan 28 11:05:14 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Wed, 28 Jan 2009 11:05:14 -0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: References: <497EF9AC.70104@poczta.onet.pl> <20090127.160750.120120703.davem@davemloft.net> Message-ID: <20090128110514.2191c7a8@extreme> On Wed, 28 Jan 2009 10:05:29 -0800 Roland Dreier wrote: > > > but actually I still don't see how it's safe for a net driver to > > > call skb_linearize() from its transmit routine, since there's a > > > chance that that will unconditionally enable BHs? > > > > It's simply not allowed. > > > > dev_queue_xmit() at a higher level can do __skb_linearize() > > because it does so before doing the rcu_read_lock_bh(). > > OK, thanks... what confused me is that several other drivers also do > skb_linearize() in their hard_start_xmit method... eg bnx2x, > via-velocity, mv643xx_eth. So there are several other lurking bugs to > deal with here I guess. > > - R. They all look like lurking (and untested) bug paths. mv643xx is especially bad since it can leak skb. But it should be possible to call pull_tail if bh is disabled (as long as irqs are enabled). From cameron at harr.org Wed Jan 28 11:58:22 2009 From: cameron at harr.org (Cameron Harr) Date: Wed, 28 Jan 2009 12:58:22 -0700 Subject: [Scst-devel] [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <4970F014.2030101@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> <490B45ED.3020203@vlnb.net> <4910A622.4050906@harr.org> <4911D827.10705@vlnb.net> <49121715.4040804@harr.org> <4912C684.5000505@vlnb.net> <491307C7.50008@harr.org> <49131A85.2010102@vlnb.net> <49189567.1010804@harr.org> <49258122.6040808@vlnb.net> <496687DA.6010707@harr.org> <496B98DF.4050305@vlnb.net> <496BD8CA.7050503@harr.org> <496C81E3.2050105@vlnb.net> <496CC493.3040207@harr.org> <496CD883.8040906@vlnb.net> <496CDFE0.2030601@harr.org> <4970F014.2030101@vl nb.net> Message-ID: <4980B8DE.3060806@harr.org> I've attached a spreadsheet with some of my findings. In the Summary tab, I have a baseline with no affinity set. For other 5 tests, see below. Vladislav Bolkhovitin wrote: > Try the following variants: > > 1. Affine IRQ 82, scsi_tgt0 to CPU0, fct0-worker to CPU2, IRQs 169 and > 177 to CPU4, scsi_tgt1 to CPU1, fct1-worker to CPU3, scsi_tgt2 to > CPU5, fct2-worker to CPU7 > > 2. Affine IRQ 82 to CPU0, fct0-worker to CPU2, IRQs 169 and 177 to > CPU4, fct1-worker to CPU3, fct2-worker to CPU7, no affinity for other > processes. > > 3. Affine IRQ 82 to CPU0, IRQs 169 and 177 to CPU4, fct1-worker's to > all CPUs, except CPU0 and CPU4, no affinity for other processes. These are tests 1, 2 and 3, respectively > > Or other similar variants you'd like (even CPUs relate to physical > CPU0, odd CPUs relate to physical CPU1). For instance, you can try to > affine IRQs 169 and 177 to CPU1. I did two other tests (Tests 4,5), that has the mlx4_core (comp) IRQ (formerly known as IRQ 82) pinned to CPU0, the two ioDrive IRQs (169, 177) pinned to CPU 4, fct0 and scsi_tgt0 on CPUs 2&3, fct1 and scsi_tgt1 on CPUs 4&6 (test 4) OR fct1 and scsi_tgt1 on CPUs 5&6. > > No points to run for srptthread=1, for it just produce a baseline with > no affinity at all. I ran with these anyway to look at differences among the tests. Having this thread enabled always results in better performance. > > Please do each run several times and write down an average result > between runs and approximate variation between them in %%. Otherwise > we can't make any reliable conclusions. I ran each test 3 times and took the averages. In order to get a quick look at performance per run, I added a column in the summary that sums the IOPs for each test with SRPT thread enabled and then not enabled. Test 4 seems to give the best results. Here's a brief summary of that summary with just SRPT thread=0: Baseline: 356226.39 Test 1: 371217.6533 Test 2: 370553.78 Test 3: 373295.2033 Test 4: 399385.2233 Test 5: 393204.5833 -------------- next part -------------- A non-text attachment was scrubbed... Name: SRP-affinity-tests.xls Type: application/vnd.ms-excel Size: 39936 bytes Desc: not available URL: From or.gerlitz at gmail.com Wed Jan 28 13:48:02 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 28 Jan 2009 23:48:02 +0200 Subject: [ofa-general] Re: problems using smpdump In-Reply-To: <20090128181924.GF8534@sashak.voltaire.com> References: <20090128181924.GF8534@sashak.voltaire.com> Message-ID: <15ddcffd0901281348o5ef29b46r15c22aaef2f44e12@mail.gmail.com> Sasha Khapyorsky wrote: > It is a bug in smpdump (not just print error), it tries to set sl explicitly to 0xffff > when using LID routed MADs (so I suppose direct routed MAD will work > ). And seems this bug was from day "0" of smpdump. Yes, it works fine with direct route, thanks for the quick resolution! > Guess that this should be a patch: I don't have the spec in front of me, but looking on the LRH structure in include/rdma/ib_pack.h the SL field spans eight bit, so I think there may be another bug somewhere in the umad code which lets the consumer set a value larger then 0xff. Or. From rdreier at cisco.com Wed Jan 28 13:52:15 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 13:52:15 -0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090128110514.2191c7a8@extreme> (Stephen Hemminger's message of "Wed, 28 Jan 2009 11:05:14 -0800") References: <497EF9AC.70104@poczta.onet.pl> <20090127.160750.120120703.davem@davemloft.net> <20090128110514.2191c7a8@extreme> Message-ID: > > OK, thanks... what confused me is that several other drivers also do > > skb_linearize() in their hard_start_xmit method... eg bnx2x, > > via-velocity, mv643xx_eth. So there are several other lurking bugs to > > deal with here I guess. > They all look like lurking (and untested) bug paths. mv643xx is especially > bad since it can leak skb. But it should be possible to call pull_tail > if bh is disabled (as long as irqs are enabled). Yes. The only obvious problem with __pskb_pull_tail() with BHs disabled is that with CONFIG_HIGHMEM set, it goes into kmap_skb_frag(), which then unconditionally does local_bh_disable()/local_bh_enable(). There's no reason in principle that kmap_skb_frag() couldn't do local_save_flags()/local_restore_flags() instead. Just grepping around I see other potential issues related to this, for example the (unused but exported) function fcoe_fc_crc() does kmap_atomic(KM_SKB_DATA_SOFTIRQ) without any particular BH disabling, which might run into trouble if used in the wrong context... - R. From sashak at voltaire.com Wed Jan 28 14:08:59 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 Jan 2009 00:08:59 +0200 Subject: [ofa-general] Re: problems using smpdump In-Reply-To: <15ddcffd0901281348o5ef29b46r15c22aaef2f44e12@mail.gmail.com> References: <20090128181924.GF8534@sashak.voltaire.com> <15ddcffd0901281348o5ef29b46r15c22aaef2f44e12@mail.gmail.com> Message-ID: <20090128220851.GG8534@sashak.voltaire.com> On 23:48 Wed 28 Jan , Or Gerlitz wrote: > > I don't have the spec in front of me, but looking on the LRH structure > in include/rdma/ib_pack.h the SL field spans eight bit, so I think > there may be another bug somewhere in the umad code which lets the > consumer set a value larger then 0xff. umad_set_addr() gets sl as 'int' (historically :( ) - the value if truncated later. Sasha From brian at sun.com Wed Jan 28 14:10:48 2009 From: brian at sun.com (Brian J. Murrell) Date: Wed, 28 Jan 2009 17:10:48 -0500 Subject: [ofa-general] OFED 1.4's autoconf.h conflicting with kernel In-Reply-To: <497A141D.90008@nasa.gov> References: <1232736152.7440.127.camel@pc.interlinx.bc.ca> <497A141D.90008@nasa.gov> Message-ID: <1233180648.9119.32.camel@pc.interlinx.bc.ca> On Fri, 2009-01-23 at 11:01 -0800, Jeff Becker wrote: > Hi Brian Hi (again) Jeff (and everyone else, especially however maintains the packaging of /usr/src/ofa_kernel), > Brian J. Murrell wrote: > > > > Some research has led me to a message > > (http://www.mail-archive.com/general at lists.openfabrics.org/msg18161.html) from Jeff Becker back on Thu, 10 Jul 2008 15:58:53 -0700 in which he submitted a patch to integrate NFSRDMA into OFED 1.4 which is what appears to have brought these changes into OFED 1.4. The more I look at this, the more I'm convinced there is either an angle I am completely missing or this is just plainly not the way to do this. It just cannot work to have two "linux/autoconf.h" files for a third (where the first two parties are OFED and the kernel) party module build. There is no guarantee that the third party module won't need to query about various CONFIG_ definitions of both the kernel and the OFED stack. The only way I can think of making this work is to "somehow" "unionize" these two files (i.e. so there is a single "superset" of them both). Perhaps it's doable with some kind of #include_next magic, perhaps not. > I usually build my kernel first (usually with NFS). Then I build OFED. Right. This is simple enough. It's when you want to build another kernel module that wants the OFED stack that things get sticky. I realize this is probably not really your area of responsibility and this dual autoconf.h problem pre-existed your patch, but your patch has really exacerbated the issue by directly conflicting (CONFIG_SYSCTL is the particular example I have on hand at the moment) some of the non-IB kernel defines. I'd love to engage whoever is directly responsible for this area of the stack but nobody seems to be responding to my queries or the bug which I filed yesterday (which is admittedly a short time ago). I'd try just posting a patch to fix it, but I think this needs some discussion on how to really achieve the end goal. Cheers, b. From rdreier at cisco.com Wed Jan 28 14:37:25 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 14:37:25 -0800 Subject: [ofa-general] Re: [PATCH 0/21] Reliable Datagram Sockets (RDS) In-Reply-To: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> (Andy Grover's message of "Mon, 26 Jan 2009 18:17:37 -0800") References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: > This patchset adds support for RDS as an Infiniband ULP. RDS is an > Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably, > and is used currently in Oracle RAC and Exadata products. It's lived > in OFED for 2+ years and I think it's time to get it upstream -- most > likely into your -next tree for .30, but if it snuck into .29 via the > "new code merge-window exception" then even better. I'll read this over and comment, but to be honest I agree with Dave: this is a new socket family, and as such it belongs under net/ and probably should go through Dave's tree (just as the NFS/RDMA changes went through the NFS trees and the 9p/RDMA changes went through the 9p tree); even though it heavily uses RDMA I think the upper layer interface to sockets/networking is more relevant. - R. From rdreier at cisco.com Wed Jan 28 14:57:01 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 14:57:01 -0800 Subject: [ofa-general] Re: [PATCH 03/21] RDS: Congestion-handling code In-Reply-To: <1233022678-9259-4-git-send-email-andy.grover@oracle.com> (Andy Grover's message of "Mon, 26 Jan 2009 18:17:40 -0800") References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-4-git-send-email-andy.grover@oracle.com> Message-ID: > +EXPORT_SYMBOL_GPL(rds_cong_map_updated); What is this being exported to? AFAICT you are only building a single RDS module, right? - R. From rdreier at cisco.com Wed Jan 28 14:59:45 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 14:59:45 -0800 Subject: [ofa-general] Re: [PATCH 20/21] RDS: Kconfig and Makefile In-Reply-To: <1233022678-9259-21-git-send-email-andy.grover@oracle.com> (Andy Grover's message of "Mon, 26 Jan 2009 18:17:57 -0800") References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-21-git-send-email-andy.grover@oracle.com> Message-ID: > +obj-$(CONFIG_INFINIBAND_ISER) += ulp/rds/ Typo for ..._RDS > +config INFINIBAND_RDS_DEBUG > + bool "Debugging messages" > + depends on INFINIBAND_RDS > + default n No way to enable this? Disabled by default? You really want debugging messages to be built by default and controlled at runtime ... otherwise debugging end-user installations is a pain (they just install what the distro gives them, and it's very hard for them to rebuild just to enable debugging). > +ib_rds-y := af_rds.o bind.o cong.o connection.o info.o message.o \ > + recv.o send.o stats.o sysctl.o threads.o transport.o \ > + loop.o page.o rdma.o > + > +ib_rds-y += ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ > + ib_sysctl.o ib_rdma.o a very strange way to write an assignment statement... - R. From rdreier at cisco.com Wed Jan 28 15:16:06 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 15:16:06 -0800 Subject: [ofa-general] Re: [PATCH] ib_mthca: Fix dispatch of IB_EVENT_LID_CHANGE In-Reply-To: <49462767.1020809@Voltaire.COM> (Moni Shoua's message of "Mon, 15 Dec 2008 11:46:15 +0200") References: <49462562.9050201@Voltaire.COM> <49462767.1020809@Voltaire.COM> Message-ID: Thanks, applied this and the mlx4 change. Comments on the changelog > This patch dispatches an event if the client_reregister bit is set. > In addition, the patch compares the LID in the MAD to the current LID. > If and only if they are not identical then a LID_CHANGE event is dispatched. > > From: Moni Shoua "From:" lines go *before* the changelog, not after, otherwise they get stuck into the git log rather than used as the patch author. > Signed-off-by: Moni Shoua > Signed-off-by: Jack Morgenstein > Signed-off-by: Yossi Etigin > > -- Need three dashes like "---", not just two for git to treat it as the end of the changelog. "--" gets put into the git log rather than being stripped. - R. From rdreier at cisco.com Wed Jan 28 15:20:08 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 15:20:08 -0800 Subject: [ofa-general] Re: potential device removal deadlock In-Reply-To: <497E0A33.2050308@opengridcomputing.com> (Steve Wise's message of "Mon, 26 Jan 2009 13:08:35 -0600") References: <497DF632.7060609@opengridcomputing.com> <497E0A33.2050308@opengridcomputing.com> Message-ID: > How could we fix this in the kernel? Perhaps ib_uverbs should post an > async error analgous to RDMA_CM_EVENT_DEVICE_REMOVAL? > > Maybe IB_EVENT_DEVICE_FATAL? > > In the case of EEH support of iw_cxgb3, I guess the driver could post > this event. That would at least kick all the user apps... Having the low-level driver generate the fatal event is in fact what mthca and mlx4 do right now... there's a certain asymmetry between IB drivers (where RDMA CM is optional) and iWARP drivers (where RDMA CM is mandatory), but the IB async event is the only thing that IB LLDs can do. I guess it would make sense for cxgb3 to do the same thing. - R. From rdreier at cisco.com Wed Jan 28 16:05:03 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 16:05:03 -0800 Subject: [ofa-general] Re: [PATCH 17/21] RDS/IB: Receive datagrams via IB In-Reply-To: <1233022678-9259-18-git-send-email-andy.grover@oracle.com> (Andy Grover's message of "Mon, 26 Jan 2009 18:17:54 -0800") References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-18-git-send-email-andy.grover@oracle.com> Message-ID: > +static int rds_ib_recv_refill_one(struct rds_connection *conn, > + struct rds_ib_recv_work *recv, > + gfp_t kptr_gfp, gfp_t page_gfp) > +{ > + struct rds_ib_connection *ic = conn->c_transport_data; > + dma_addr_t dma_addr; > + struct ib_sge *sge; > + int ret = -ENOMEM; > + > + if (recv->r_ibinc == NULL) { > + if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) { > + rds_ib_stats_inc(s_ib_rx_alloc_limit); > + goto out; > + } > + recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab, > + kptr_gfp); > + if (recv->r_ibinc == NULL) > + goto out; > + atomic_inc(&rds_ib_allocation); This is racy. You check if you're at the limit, do the allocation, and then increment the atomic rds_ib_allocation count. So many threads can pass the atomic_read() test and then take you over the limit. If you want to make it safe then you could do atomic_inc_return() and check if that took you over the limit. - R. From robert.j.woodruff at intel.com Wed Jan 28 13:23:41 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 28 Jan 2009 13:23:41 -0800 Subject: [ofa-general] RE: [ewg] [ANNOUNCE] RHEL5.3 support added to OFED-1.4 (latest daily build) In-Reply-To: <497F1B28.4010607@dev.mellanox.co.il> References: <497F1B28.4010607@dev.mellanox.co.il> Message-ID: <382A478CAD40FA4FB46605CF81FE39F41CC12339@orsmsx507.amr.corp.intel.com> I tried this on my systems, It seems to install and work fine on X86_64 platforms, but I get this compiler error on Itanium. woody ckport/2.6.18-EL5.3/include/linux/inetdevice.h:4, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.c:37: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18-EL5.3/include/asm/checksum.h:9: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'backport_csum_tcpudp_nofold' make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.o] Error 1 make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core] Error 2 -----Original Message----- From: ewg-bounces at lists.openfabrics.org [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of Vladimir Sokolovsky Sent: Tuesday, January 27, 2009 6:33 AM To: OF-EWG Cc: OpenFabrics General Subject: [ewg] [ANNOUNCE] RHEL5.3 support added to OFED-1.4 (latest daily build) Hi, RHEL5.3 support was added to OFED-1.4 daily builds, starting from OFED-1.4-20090127-0600. OFED-1.4 daily builds are available under: http://www.openfabrics.org/downloads/OFED/ofed-1.4-daily/ Regards, Vladimir _______________________________________________ ewg mailing list ewg at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From andy.grover at oracle.com Wed Jan 28 17:29:40 2009 From: andy.grover at oracle.com (Andy Grover) Date: Wed, 28 Jan 2009 17:29:40 -0800 Subject: [ofa-general] Re: [PATCH 0/21] Reliable Datagram Sockets (RDS) In-Reply-To: References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> Message-ID: <49810684.3010305@oracle.com> Roland Dreier wrote: > > This patchset adds support for RDS as an Infiniband ULP. RDS is an > > Oracle-originated protocol used to send IPC datagrams (up to 1MB) reliably, > > and is used currently in Oracle RAC and Exadata products. It's lived > > in OFED for 2+ years and I think it's time to get it upstream -- most > > likely into your -next tree for .30, but if it snuck into .29 via the > > "new code merge-window exception" then even better. > > I'll read this over and comment, but to be honest I agree with Dave: > this is a new socket family, and as such it belongs under net/ and > probably should go through Dave's tree (just as the NFS/RDMA changes > went through the NFS trees and the 9p/RDMA changes went through the 9p > tree); even though it heavily uses RDMA I think the upper layer > interface to sockets/networking is more relevant. OK no prob. I'll probably be ready early next week with an updated patchset. Thanks -- Regards -- Andy From andy.grover at oracle.com Wed Jan 28 18:19:42 2009 From: andy.grover at oracle.com (Andy Grover) Date: Wed, 28 Jan 2009 18:19:42 -0800 Subject: [ofa-general] Re: [PATCH 20/21] RDS: Kconfig and Makefile In-Reply-To: References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-21-git-send-email-andy.grover@oracle.com> Message-ID: <4981123E.2060406@oracle.com> Roland Dreier wrote: > > +obj-$(CONFIG_INFINIBAND_ISER) += ulp/rds/ > > Typo for ..._RDS Whups. :) > > +config INFINIBAND_RDS_DEBUG > > + bool "Debugging messages" > > + depends on INFINIBAND_RDS > > + default n > > No way to enable this? Disabled by default? > > You really want debugging messages to be built by default and controlled > at runtime ... otherwise debugging end-user installations is a pain > (they just install what the distro gives them, and it's very hard for > them to rebuild just to enable debugging). So the solution is just to base debug message output on a variable, instead of a config option? RDS actually does do this a little already, so converting totally isn't hard. I hadn't seen mention this was preferable -- indeed, tons of drivers and subsystems have options for compile-time debug statements, should these be converted? > > +ib_rds-y := af_rds.o bind.o cong.o connection.o info.o message.o \ > > + recv.o send.o stats.o sysctl.o threads.o transport.o \ > > + loop.o page.o rdma.o > > + > > +ib_rds-y += ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ > > + ib_sysctl.o ib_rdma.o > > a very strange way to write an assignment statement... RDS is implemented as a core sockets layer and then a transport layer. IB is currently the only transport so I thought it made sense to just compile them together, but once there are >1 then RDS's IB support could be broken out into its own module. Regards -- Andy From andy.grover at oracle.com Wed Jan 28 18:20:40 2009 From: andy.grover at oracle.com (Andy Grover) Date: Wed, 28 Jan 2009 18:20:40 -0800 Subject: [ofa-general] Re: [PATCH 17/21] RDS/IB: Receive datagrams via IB In-Reply-To: References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-18-git-send-email-andy.grover@oracle.com> Message-ID: <49811278.3050806@oracle.com> Roland Dreier wrote: > > +static int rds_ib_recv_refill_one(struct rds_connection *conn, > > + struct rds_ib_recv_work *recv, > > + gfp_t kptr_gfp, gfp_t page_gfp) > > +{ > > + struct rds_ib_connection *ic = conn->c_transport_data; > > + dma_addr_t dma_addr; > > + struct ib_sge *sge; > > + int ret = -ENOMEM; > > + > > + if (recv->r_ibinc == NULL) { > > + if (atomic_read(&rds_ib_allocation) >= rds_ib_sysctl_max_recv_allocation) { > > + rds_ib_stats_inc(s_ib_rx_alloc_limit); > > + goto out; > > + } > > + recv->r_ibinc = kmem_cache_alloc(rds_ib_incoming_slab, > > + kptr_gfp); > > + if (recv->r_ibinc == NULL) > > + goto out; > > + atomic_inc(&rds_ib_allocation); > > This is racy. You check if you're at the limit, do the allocation, and > then increment the atomic rds_ib_allocation count. So many threads can > pass the atomic_read() test and then take you over the limit. If you > want to make it safe then you could do atomic_inc_return() and check if > that took you over the limit. Woah, yup, thanks. -- Andy From andy.grover at oracle.com Wed Jan 28 18:39:31 2009 From: andy.grover at oracle.com (Andy Grover) Date: Wed, 28 Jan 2009 18:39:31 -0800 Subject: [ofa-general] Re: [PATCH 03/21] RDS: Congestion-handling code In-Reply-To: References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-4-git-send-email-andy.grover@oracle.com> Message-ID: <498116E3.3060506@oracle.com> Roland Dreier wrote: > > +EXPORT_SYMBOL_GPL(rds_cong_map_updated); > > What is this being exported to? AFAICT you are only building a single > RDS module, right? In the current RDS development repo, transports are modularizable. For the initial upstream submission I just wanted to include the IB transport so I changed the build to compile rds-core and rds-ib together, but didn't pull out the exports. Steve Wise is working on having the iwarp transport debugged in the near future. Once that is added and we make transports modularizable then those exports are needed. Take them out for now, then? Thanks -- Regards -- Andy From andy.grover at gmail.com Wed Jan 28 19:03:49 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Wed, 28 Jan 2009 19:03:49 -0800 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling In-Reply-To: <497F3640.6070600@opengridcomputing.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-7-git-send-email-andy.grover@oracle.com> <20090127133418.GH2646@ioremap.net> <200901271447.29377.oliver@neukum.org> <497F3640.6070600@opengridcomputing.com> Message-ID: On Tue, Jan 27, 2009 at 8:28 AM, Steve Wise wrote: > Oliver Neukum wrote: >> >> Am Tuesday 27 January 2009 14:34:19 schrieb Evgeniy Polyakov: >> >>> >>> On Mon, Jan 26, 2009 at 06:17:43PM -0800, Andy Grover >>> (andy.grover at oracle.com) wrote: >>> >>>> >>>> +static inline int rds_conn_is_sending(struct rds_connection *conn) >>>> +{ >>>> + int ret = 0; >>>> + >>>> + if (!mutex_trylock(&conn->c_send_lock)) >>>> + ret = 1; >>>> + else >>>> + mutex_unlock(&conn->c_send_lock); >>>> + >>>> + return ret; >>>> +} >>>> + >>>> >>> >>> This one is eventually invoked under the spin_lock with turned off irqs, >>> which may freeze the machine: >>> rds_for_each_conn_info() -> spin_lock_irqsave(global lock) -> >>> rds_conn_info_visitor() -> rds_conn_info_set() -> rds_conn_is_sending() >>> -> boom. >>> >> >> Why? This is _trylock. It won't block. >> >> > > mutex_trylock() uses spin_lock_mutex() which has this in the debug version: > > DEBUG_LOCKS_WARN_ON(in_interrupt()); What's the best way to fix this? This is all so rds-info can print out a nice list of connections, and if they're sending or not. I don't see an easy way to fix this. A _trylock-like function that didn't grab it would be nice? I can always just not report this particular bit of info, that actually might be easiest. Regards -- Andy From andy.grover at gmail.com Wed Jan 28 19:17:05 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Wed, 28 Jan 2009 19:17:05 -0800 Subject: [ofa-general] Re: [PATCH 01/21] RDS: Socket interface In-Reply-To: <20090126194619.02c9557e@extreme> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-2-git-send-email-andy.grover@oracle.com> <20090126194619.02c9557e@extreme> Message-ID: On Mon, Jan 26, 2009 at 7:46 PM, Stephen Hemminger wrote: >> +/* this is just used for stats gathering :/ */ > > Then I would think a high speed protocol would use per-cpu > and/or rcu. For a spinlock guarding a socket list? I wouldn't think it would be worth the complexity. [other comments snipped, will fix per your advice, thanks!] Regards -- Andy From andy.grover at gmail.com Wed Jan 28 20:02:49 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Wed, 28 Jan 2009 20:02:49 -0800 Subject: [ofa-general] Re: [PATCH 01/21] RDS: Socket interface In-Reply-To: <20090127120840.GC2646@ioremap.net> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-2-git-send-email-andy.grover@oracle.com> <20090127120840.GC2646@ioremap.net> Message-ID: On Tue, Jan 27, 2009 at 4:08 AM, Evgeniy Polyakov wrote: > Hi Andy. Hi Evgeniy thanks for your time in reviewing. >> +/* this is just used for stats gathering :/ */ > > Shouldn't this be some kind of per-cpu data? > Global list of all sockets? This does not scale, maybe it should be > groupped into hash table or be per-device? sch mentioned this too... is socket creation often a bottleneck? If so we can certainly improve scalability here. In any case, this is in the code to support a listing of RDS sockets via the rds-info utility. Instead of having our own custom program to list rds sockets we probably want to export an interface so netstat will list them. Unfortunately netstat seems to be hardcoded to look for particular entries in /proc/net, so both rds and netstat would need to be updated before this would work, and RDS's custom socket-listing interface dropped. >> +static int rds_release(struct socket *sock) >> +{ >> + struct sock *sk = sock->sk; >> + struct rds_sock *rs; >> + unsigned long flags; >> + >> + if (sk == NULL) >> + goto out; >> + >> + rs = rds_sk_to_rs(sk); >> + >> + sock_orphan(sk); > > Why is it needed getting socket is about to be freed? from the comments above that code: * We have to be careful about racing with the incoming path. sock_orphan() * sets SOCK_DEAD and we use that as an indicator to the rx path that new * messages shouldn't be queued. Is that an appropriate usage of sock_orphan()? > Does RDS sockets work with high number of creation/destruction > workloads? I'd guess from your comments that performance probably wouldn't be great :) >> +static unsigned int rds_poll(struct file *file, struct socket *sock, >> + poll_table *wait) >> +{ >> + struct sock *sk = sock->sk; >> + struct rds_sock *rs = rds_sk_to_rs(sk); >> + unsigned int mask = 0; >> + unsigned long flags; >> + >> + poll_wait(file, sk->sk_sleep, wait); >> + >> + poll_wait(file, &rds_poll_waitq, wait); >> + > > Are you absolutely sure that provided poll_table callback > will not do the bad things here? It is quite unusual to add several > different queues into the same head in the poll callback. > And shouldn't rds_poll_waitq be lock protected here? I don't know. I looked into the poll_wait code a little and it appeared to be designed to allow multiple. I'm not very strong in this area and would love some more expert input here. >> + read_lock_irqsave(&rs->rs_recv_lock, flags); >> + if (!rs->rs_cong_monitor) { >> + /* When a congestion map was updated, we signal POLLIN for >> + * "historical" reasons. Applications can also poll for >> + * WRBAND instead. */ >> + if (rds_cong_updated_since(&rs->rs_cong_track)) >> + mask |= (POLLIN | POLLRDNORM | POLLWRBAND); >> + } else { >> + spin_lock(&rs->rs_lock); > > Is there a possibility to have lock iteraction problem with above > rs_recv_lock read lock? I didn't see anywhere where they were being acquired in reverse order, or simultaneously. This is the kind of thing that lockdep would find immediately, right? I think I've got that turned on but I'll double check. >> +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24) > > This should be dropped in the mainline tree. yup. > Hash table with the appropriate size will have faster lookup/access > times btw. No doubt. Definitely want to make this improvement at some point. >> +static struct rds_sock *rds_bind_tree_walk(__be32 addr, __be16 port, >> + struct rds_sock *insert) >> +{ >> + struct rb_node **p = &rds_bind_tree.rb_node; >> + struct rb_node *parent = NULL; >> + struct rds_sock *rs; >> + u64 cmp; >> + u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port); >> + >> + while (*p) { >> + parent = *p; >> + rs = rb_entry(parent, struct rds_sock, rs_bound_node); >> + >> + cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) | >> + be16_to_cpu(rs->rs_bound_port); >> + >> + if (needle < cmp) > > Should it use wrapping logic if some field overflows? Sorry, please explain? >> + rdsdebug("returning rs %p for %u.%u.%u.%u:%u\n", rs, NIPQUAD(addr), >> + ntohs(port)); > > Iirc there is a new %pi4 or similar format id. Yup, will do. Thanks again. Regards -- Andy From rdreier at cisco.com Wed Jan 28 21:14:27 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 28 Jan 2009 21:14:27 -0800 Subject: [ofa-general] Re: [PATCH 20/21] RDS: Kconfig and Makefile In-Reply-To: <4981123E.2060406@oracle.com> (Andy Grover's message of "Wed, 28 Jan 2009 18:19:42 -0800") References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-21-git-send-email-andy.grover@oracle.com> <4981123E.2060406@oracle.com> Message-ID: > So the solution is just to base debug message output on a variable, > instead of a config option? RDS actually does do this a little already, > so converting totally isn't hard. I hadn't seen mention this was > preferable -- indeed, tons of drivers and subsystems have options for > compile-time debug statements, should these be converted? My experience is definitely that compile-time switches are a big pain when you actually have to debug something that can only be reproduced on someone else's setup (which will happen once users start using your stuff). You probably can use the dynamic_printk stuff that went in recently to make this all very clean and standard. - R. From jackm at dev.mellanox.co.il Wed Jan 28 23:36:51 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Thu, 29 Jan 2009 09:36:51 +0200 Subject: [ofa-general] Re: IPoIB kernel Oops -- possible race condition identified. In-Reply-To: References: <200901261741.08824.jackm@dev.mellanox.co.il> <200901271107.03074.jackm@dev.mellanox.co.il> Message-ID: <200901290936.52011.jackm@dev.mellanox.co.il> On Wednesday 28 January 2009 20:53, Roland Dreier wrote: > > - priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > > + spin_lock_irq(&priv->lock); > > + if (priv->broadcast) > > + priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu)); > > + else > > + priv->mcast_mtu = priv->admin_mtu; > > Looks reasonable, except what about the code shortly before that in > ipoib_mcast_join_task(): > > if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) { > if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) > ipoib_mcast_join(dev, priv->broadcast, 0); > return; > } > > is there any reason why that is safe without locking around the using > priv->broadcast? (Fixing that place looks harder, since > ipoib_mcast_join() currently may sleep) > You're right! (I just did not notice this -- was focussed on our Oops). I also do not see any reasonable solution as yet -- as soon as we release the spinlock, priv->broadcast may become NULL if flush is going on. Furthermore, even if we copied the pointer, passing priv->broadcast as a parameter in ipoib_mcast_join is also problematic, since the pointer may become obsolete while ipoib_mcast_join is running. Seems to me that we may need a reference counter on priv->broadcast so that it does not get freed to soon (in ipoib_mcast_free). This means that all multicast objects need reference counting. Ouch! From zbr at ioremap.net Thu Jan 29 00:03:30 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Thu, 29 Jan 2009 11:03:30 +0300 Subject: [ofa-general] Re: [PATCH 06/21] RDS: Connection handling In-Reply-To: References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-7-git-send-email-andy.grover@oracle.com> <20090127133418.GH2646@ioremap.net> <200901271447.29377.oliver@neukum.org> <497F3640.6070600@opengridcomputing.com> Message-ID: <20090129080330.GB12197@ioremap.net> On Wed, Jan 28, 2009 at 07:03:49PM -0800, Andrew Grover (andy.grover at gmail.com) wrote: > What's the best way to fix this? > > This is all so rds-info can print out a nice list of connections, and > if they're sending or not. I don't see an easy way to fix this. A > _trylock-like function that didn't grab it would be nice? I can always > just not report this particular bit of info, that actually might be > easiest. You use atomic variables for the other cases, add another one here to mark locked connection. Looks ugly but does not crash at least. -- Evgeniy Polyakov From tziporet at dev.mellanox.co.il Thu Jan 29 01:33:45 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 29 Jan 2009 11:33:45 +0200 Subject: [ofa-general] Re: [ewg] [ANNOUNCE] RHEL5.3 support added to OFED-1.4 (latest daily build) In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F41CC12339@orsmsx507.amr.corp.intel.com> References: <497F1B28.4010607@dev.mellanox.co.il> <382A478CAD40FA4FB46605CF81FE39F41CC12339@orsmsx507.amr.corp.intel.com> Message-ID: <498177F9.60107@mellanox.co.il> Woodruff, Robert J wrote: > I tried this on my systems, It seems to install and work fine on > X86_64 platforms, but I get this compiler error on Itanium. > > woody > > ckport/2.6.18-EL5.3/include/linux/inetdevice.h:4, > from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.c:37: > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18-EL5.3/include/asm/checksum.h:9: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'backport_csum_tcpudp_nofold' > make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.o] Error 1 > make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core] Error 2 > We don't have Itanium system with this OS here Can someone from your team solve this and send us the relevant backport pathces Tziporet From tziporet at dev.mellanox.co.il Thu Jan 29 01:37:01 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 29 Jan 2009 11:37:01 +0200 Subject: [ofa-general] Re: [ewg] RE: RHEL 5.3 and OFED 1.4.x In-Reply-To: <49807BE4.40305@opengridcomputing.com> References: <382A478CAD40FA4FB46605CF81FE39F41C6FB9E9@orsmsx507.amr.corp.intel.com> <4978E8FB.5040909@opengridcomputing.com> <382A478CAD40FA4FB46605CF81FE39F41C6FBB78@orsmsx507.amr.corp.intel.com> <4978EE0E.5050209@opengridcomputing.com> <1233151056.5637.2.camel@firewall.xsintricity.com> <49807BE4.40305@opengridcomputing.com> Message-ID: <498178BD.6000009@mellanox.co.il> Steve Wise wrote: > > One way to alleviate this is to ship both 1.2.8 and 1.3 in ofed-1.4.1 > and mark 1.3 as "experimental". Then remove 1.2.8 in ofed-1.5 and > make 1.3.x the production version for ofed-1.5. > > I suggested this in the last conf call but folks didn't like the > thought of testing both. But perhaps marking it "experimental" > resolves this issue? So the iWARP vendors will test 1.3 and little to > no testing is required for 1.2.8 since it has been qualified with > ofed-1.4 QA. > We do not wish to have two MPI revisions in the same OFED package We talked about it several times and declined this option. Tziporet From vlad at lists.openfabrics.org Thu Jan 29 03:13:49 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 29 Jan 2009 03:13:49 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090129-0200 daily build status Message-ID: <20090129111350.3229AE61182@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Thu Jan 29 03:43:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 Jan 2009 13:43:21 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: fix port encoding in PortInfoRecord Message-ID: <20090129114321.GH8534@sashak.voltaire.com> PortNum field in PortInfoRecord has length 8 bit so initial value should not be converted to network byte order. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 9dd3bdb..0ba2d7f 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1121,7 +1121,7 @@ static int query_portinfo_records(const struct query_cmd *q, comp_mask |= IB_PIR_COMPMASK_LID; } if (port >= 0) { - pir.port_num = cl_hton16(port); + pir.port_num = port; comp_mask |= IB_PIR_COMPMASK_PORTNUM; } -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Thu Jan 29 03:52:56 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 Jan 2009 13:52:56 +0200 Subject: [ofa-general] [PATCH] infiniband-diags: fix lid setup in NodeRecord Message-ID: <20090129115256.GI8534@sashak.voltaire.com> Lid value should be initialized to prevent garbage encoding in NodeRecord query. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index c62bf42..84f3c91 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1059,7 +1059,7 @@ static int query_node_records(const struct query_cmd *q, { ib_node_record_t nr; ib_net64_t comp_mask = 0; - int lid; + int lid = 0; ib_api_status_t status; if (argc > 0) -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Thu Jan 29 05:04:07 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 Jan 2009 15:04:07 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: fix encoding of SA queries Message-ID: <20090129130407.GK8534@sashak.voltaire.com> Various queries encoding fixes: - PortInfoRecord: PortNum has length 8 bit and should not be converted to network byte order. - NodeRecord: initialize lid value to prevent garbage in a query. - PkeyTableRecord: BlockNumber has 16 bit length and should be encoded in network byte order. Signed-off-by: Sasha Khapyorsky --- I found one more encoding bug and decided to merge this with previoisly posted similar patches. infiniband-diags/src/saquery.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 9dd3bdb..e6a8d52 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -1078,7 +1078,7 @@ static int query_node_records(const struct query_cmd *q, { ib_node_record_t nr; ib_net64_t comp_mask = 0; - int lid; + int lid = 0; ib_api_status_t status; if (argc > 0) @@ -1121,7 +1121,7 @@ static int query_portinfo_records(const struct query_cmd *q, comp_mask |= IB_PIR_COMPMASK_LID; } if (port >= 0) { - pir.port_num = cl_hton16(port); + pir.port_num = port; comp_mask |= IB_PIR_COMPMASK_PORTNUM; } @@ -1316,7 +1316,7 @@ static int query_pkey_tbl_records(const struct query_cmd *q, comp_mask |= IB_PKEY_COMPMASK_PORT; } if (block >= 0) { - pktr.block_num = block; + pktr.block_num = cl_hton16(block); comp_mask |= IB_PKEY_COMPMASK_BLOCK; } -- 1.6.0.4.766.g6fc4a From PHF at zurich.ibm.com Thu Jan 29 05:47:39 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Thu, 29 Jan 2009 14:47:39 +0100 Subject: [ofa-general] Re: 0-length RDMA Read In-Reply-To: <49807AB1.3070308@opengridcomputing.com> References: <49807AB1.3070308@opengridcomputing.com> Message-ID: Steve, many thanks for your quick response! I read about this P2P feature but did not know how to activate it. In the meantime I decided to do a 0-length RDMA Write from the initiator - that worked right away and has the same effect. Cheers, Philip Steve Wise wrote on 01/28/2009 04:33:05 PM: > [image removed] > > Re: 0-length RDMA Read > > Steve Wise > > to: > > Philip Frey1 > > 01/28/2009 04:34 PM > > Cc: > > general, Felix Marti > > This looks like a bug. The lib assumes an SGE entry will be provided. > > A workaround for now is to set num_sge to 1 and initialize the sge entry to: > > sge.addr 0 > sge.lkey 2 > sge.length 0 > rkey 2 > remote_addr 0 > > I'll fix this in libcxgb3 to allow num_sge == 0 to mean 0B read. Also, > right now you need to specify non-zero (and yet still valid possible) > lkey and rkey values. I'll fix this too so if length is 0 or num_sge is > 0, then the library will create a valid 0B read request for you ignoring > the other fields. I opened bug 1496 for this: > https://bugs.openfabrics.org/show_bug.cgi?id=1496 > > Lemme know if you want a new libcxgb3 tarball with the fix. > > By the way, as of ofed-1.3.1 and 2.6.27 kernels, iw_cxgb3 supports a > mode where it handles this client-must-send-first issue for you. There > is a module option called peer2peer. Set it to 1 and all subsequent > connections will handle this by doing a 0B read from the client. > > 'echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer' will do the trick... > > Steve. > > > Philip Frey1 wrote: > > > > Hi, > > > > since for iWARP, that after the MPA connection establishment, the > > MPA initiator must send the first FPDU, I wanted to do that using a > > 0-length > > RDMA Read. When using the T3 Chelsio RNIC, I end up with a segmentation > > fault from the libcxgb3 (function t3b_post_send). > > > > I was trying to post a 0-lenght WR for RDMA Read like this: > > > > struct ibv_send_wr wr; > > > > wr.wr_id = 1; > > wr.next = NULL; > > wr.sg_list = NULL; > > wr.num_sge = 0; > > wr.wr.rdma.remote_addr = 0; > > wr.wr.rdma.rkey = 0; > > wr.opcode = IBV_WR_RDMA_READ; > > wr.send_flags = IBV_SEND_SIGNALED; > > > > ibv_post_send(qp, &wr, &bad_wr); > > > > Question1: Are 0-length RDMA Reads supported at all by the T3? > > Question2: If they are, how do I have to write a correct send WR? > > > > Many thanks for your advice, > > Philip > > > > > > > > -- > > Philip Frey > > IBM Zurich Research Laboratory > > Saumerstrasse 4 | Phone: +41 44 > > 724 8613 > > CH-8803 Rueschlikon/Switzerland | Email: phf at zurich.ibm.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eli at dev.mellanox.co.il Thu Jan 29 07:24:25 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 29 Jan 2009 17:24:25 +0200 Subject: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem In-Reply-To: References: <20090125094506.GA19444@mtls03> <20090126182840.GA22015@mtls03> Message-ID: <20090129152425.GA30933@mtls03> On Mon, Jan 26, 2009 at 10:26:25AM -0800, Roland Dreier wrote: > > It is has to be saved either at the low level driver's mr object, > > e.g. struct mlx4_ib_mr, or at a common place like struct ib_umem. Do > > you prefer that it will be saved in struct mlx4_ib_mr? > > I don't see why it has to be saved anywhere? The only place you use > umem->address is in handle_hugetlb_usermr(), and you could just as > easily pass in start directly as a parameter (since > mlx4_ib_reg_user_mr() has that value in a parameter anyway). > Sorry for the delayed response. I see... you're right - no need to stick the address into struct ib_umem. Following this email is a new patch for mlx4_ib only. I excluded support for both powerpc and ia64 since I could not find a way to get HPAGE_SIZE (or HPAGE_SHIFT) for them. From eli at mellanox.co.il Thu Jan 29 07:27:25 2009 From: eli at mellanox.co.il (Eli Cohen) Date: Thu, 29 Jan 2009 17:27:25 +0200 Subject: [ofa-general] [PATCH v3] mlx4_ib: Optimize hugetlab pages support Message-ID: <20090129152725.GA26284@mtls03> Since Linux may not merge adjacent pages into a single scatter entry through calls to dma_map_sg(), we check the special case of hugetlb pages which are likely to be mapped to coniguous dma addresses and if they are, take advantage of this. This will result in a significantly lower number of MTT segments used for registering hugetlb memory regions. Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mlx4/mr.c | 81 ++++++++++++++++++++++++++++++++++---- 1 files changed, 72 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 8e4d26d..6e898a9 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -119,6 +119,66 @@ out: return err; } +static int handle_hugetlb_user_mr(struct ib_pd *pd, struct mlx4_ib_mr *mr, + u64 start, u64 virt_addr, int access_flags) +{ +#if defined(CONFIG_HUGETLB_PAGE) && !defined(__powerpc__) && !defined(__ia64__) + struct mlx4_ib_dev *dev = to_mdev(pd->device); + struct ib_umem_chunk *chunk; + unsigned dsize; + dma_addr_t daddr; + unsigned cur_size = 0; + dma_addr_t uninitialized_var(cur_addr); + int n; + struct ib_umem *umem = mr->umem; + u64 *arr; + int err = 0; + int i; + int j = 0; + int off = start & (HPAGE_SIZE - 1); + + n = DIV_ROUND_UP(off + umem->length, HPAGE_SIZE); + arr = kmalloc(n * sizeof *arr, GFP_KERNEL); + if (!arr) + return -ENOMEM; + + list_for_each_entry(chunk, &umem->chunk_list, list) + for (i = 0; i < chunk->nmap; ++i) { + daddr = sg_dma_address(&chunk->page_list[i]); + dsize = sg_dma_len(&chunk->page_list[i]); + if (!cur_size) { + cur_addr = daddr; + cur_size = dsize; + } else if (cur_addr + cur_size != daddr) { + err = -EINVAL; + goto out; + } else + cur_size += dsize; + + if (cur_size > HPAGE_SIZE) { + err = -EINVAL; + goto out; + } else if (cur_size == HPAGE_SIZE) { + cur_size = 0; + arr[j++] = cur_addr; + } + } + + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, umem->length, + convert_access(access_flags), n, HPAGE_SHIFT, &mr->mmr); + if (err) + goto out; + + err = mlx4_write_mtt(dev->dev, &mr->mmr.mtt, 0, n, arr); + +out: + kfree(arr); + return err; +#else + return -ENOSYS; +#endif +} + struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt_addr, int access_flags, struct ib_udata *udata) @@ -140,17 +200,20 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, goto err_free; } - n = ib_umem_page_count(mr->umem); - shift = ilog2(mr->umem->page_size); + if (!mr->umem->hugetlb || + handle_hugetlb_user_mr(pd, mr, start, virt_addr, access_flags)) { + n = ib_umem_page_count(mr->umem); + shift = ilog2(mr->umem->page_size); - err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, - convert_access(access_flags), n, shift, &mr->mmr); - if (err) - goto err_umem; + err = mlx4_mr_alloc(dev->dev, to_mpd(pd)->pdn, virt_addr, length, + convert_access(access_flags), n, shift, &mr->mmr); + if (err) + goto err_umem; - err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); - if (err) - goto err_mr; + err = mlx4_ib_umem_write_mtt(dev, &mr->mmr.mtt, mr->umem); + if (err) + goto err_mr; + } err = mlx4_mr_enable(dev->dev, &mr->mmr); if (err) -- 1.6.1 From kliteyn at dev.mellanox.co.il Thu Jan 29 07:26:37 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 29 Jan 2009 17:26:37 +0200 Subject: [ofa-general] [PATCH RFC] libibumad: give up cpu time slice if write() falied Message-ID: <4981CAAD.1090109@dev.mellanox.co.il> Hi Sasha, While running opensm on a single-core CPU I've noticed the following problem: when SA is stressed with many SA queries (such as when you run "osmtest -f f" on a multi-core CPU machine), sometimes opensm fails to send responses. It appears that send buffer gets full, and write() fails. Since the CPU has a single core, when the sender thread retries to send the same packet, it fails again and again, because the driver didn't have the chance to transmit something from the send buffer. Then I added 20 msec sleep after write() failure to make it give up the cpu time slice, and I saw some improvement. When I added sleep of several seconds, the problem disappeared completely, but then of course, the client's transaction will fail with timeout unless you specifically increase the transaction timeout on the client side. So, what do you think about the following patch? Signed-off-by: Yevgeny Kliteynik --- libibumad/src/umad.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c index 78b956d..a03a018 100644 --- a/libibumad/src/umad.c +++ b/libibumad/src/umad.c @@ -814,6 +814,8 @@ umad_send(int fd, int agentid, void *umad, int length, DEBUG("write returned %d != sizeof umad %zu + length %d (%m)", n, umad_size(), length); + + usleep(20000); if (!errno) errno = EIO; return -EIO; -- 1.5.1.4 From rdreier at cisco.com Thu Jan 29 07:33:22 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 29 Jan 2009 07:33:22 -0800 Subject: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem In-Reply-To: <20090129152425.GA30933@mtls03> (Eli Cohen's message of "Thu, 29 Jan 2009 17:24:25 +0200") References: <20090125094506.GA19444@mtls03> <20090126182840.GA22015@mtls03> <20090129152425.GA30933@mtls03> Message-ID: > I see... you're right - no need to stick the address into struct > ib_umem. Following this email is a new patch for mlx4_ib only. I > excluded support for both powerpc and ia64 since I could not find a > way to get HPAGE_SIZE (or HPAGE_SHIFT) for them. #include ? - R. From kliteyn at dev.mellanox.co.il Thu Jan 29 07:35:42 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 29 Jan 2009 17:35:42 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c: fix full topology dump Message-ID: <4981CCCE.7030104@dev.mellanox.co.il> Sasha, Full topology dump was missing leaf switches - fixing. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_ftree.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index ebe6612..68900d8 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -1103,7 +1103,7 @@ static void __osm_ftree_fabric_dump(ftree_fabric_t * p_ftree) __osm_ftree_hca_dump(p_ftree, p_hca); } - for (i = 0; i < p_ftree->max_switch_rank; i++) { + for (i = 0; i <= p_ftree->max_switch_rank; i++) { OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "-- Rank %u switches\n", i); for (p_sw = (ftree_sw_t *) cl_qmap_head(&p_ftree->sw_tbl); -- 1.5.1.4 From eli at dev.mellanox.co.il Thu Jan 29 07:46:42 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 29 Jan 2009 17:46:42 +0200 Subject: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem In-Reply-To: References: <20090125094506.GA19444@mtls03> <20090126182840.GA22015@mtls03> <20090129152425.GA30933@mtls03> Message-ID: <20090129154642.GA26302@mtls03> On Thu, Jan 29, 2009 at 07:33:22AM -0800, Roland Dreier wrote: > > I see... you're right - no need to stick the address into struct > > ib_umem. Following this email is a new patch for mlx4_ib only. I > > excluded support for both powerpc and ia64 since I could not find a > > way to get HPAGE_SIZE (or HPAGE_SHIFT) for them. > > #include ? > It does not help. The problem with powerpc is that HPAGE_SHIFT is an unexported variable and for ia64 it's hpage_shift. From zbr at ioremap.net Thu Jan 29 08:24:25 2009 From: zbr at ioremap.net (Evgeniy Polyakov) Date: Thu, 29 Jan 2009 19:24:25 +0300 Subject: [ofa-general] Re: [PATCH 01/21] RDS: Socket interface In-Reply-To: References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-2-git-send-email-andy.grover@oracle.com> <20090127120840.GC2646@ioremap.net> Message-ID: <20090129162425.GA31947@ioremap.net> Hi Andy. On Wed, Jan 28, 2009 at 08:02:49PM -0800, Andrew Grover (andy.grover at gmail.com) wrote: > Hi Evgeniy thanks for your time in reviewing. No problem :) > >> +/* this is just used for stats gathering :/ */ > > > > Shouldn't this be some kind of per-cpu data? > > Global list of all sockets? This does not scale, maybe it should be > > groupped into hash table or be per-device? > > sch mentioned this too... is socket creation often a bottleneck? If so > we can certainly improve scalability here. It depends on the workload, but and it becomes a noticeble portion of the overhead for multi-client short-living connections. Likely this sockets will not be used for web-server like load, but something similar will clearly show a bottleneck with this list. > In any case, this is in the code to support a listing of RDS sockets > via the rds-info utility. Instead of having our own custom program to > list rds sockets we probably want to export an interface so netstat > will list them. Unfortunately netstat seems to be hardcoded to look > for particular entries in /proc/net, so both rds and netstat would > need to be updated before this would work, and RDS's custom > socket-listing interface dropped. Other sockets use similar technique, but they are groupped into hash table, so if you think that amount of socket will be noticebly large or they will be frequently created and removed, it may worth pushing them into the hash table. > Is that an appropriate usage of sock_orphan()? It is, I missed the process context socket detouch, things are correct. > > Does RDS sockets work with high number of creation/destruction > > workloads? > > I'd guess from your comments that performance probably wouldn't be great :) Something tells me the same :) > >> +static unsigned int rds_poll(struct file *file, struct socket *sock, > >> + poll_table *wait) > >> +{ > >> + struct sock *sk = sock->sk; > >> + struct rds_sock *rs = rds_sk_to_rs(sk); > >> + unsigned int mask = 0; > >> + unsigned long flags; > >> + > >> + poll_wait(file, sk->sk_sleep, wait); > >> + > >> + poll_wait(file, &rds_poll_waitq, wait); > >> + > > > > Are you absolutely sure that provided poll_table callback > > will not do the bad things here? It is quite unusual to add several > > different queues into the same head in the poll callback. > > And shouldn't rds_poll_waitq be lock protected here? > > I don't know. I looked into the poll_wait code a little and it > appeared to be designed to allow multiple. > > I'm not very strong in this area and would love some more expert input here. It depends on how poll_table was initialized and how its callback (invoked from the poll_wait()) operates with the given queue and head. If you introduce own polling, some care has to be taken there for the ordering of the wait queues and what their callbacks return when polling even found. For example with the own initialization it is possible that with multiple queues are registered in the same table, only one of them will be awakened (its callback invoked). If you just hook into existing machinery things should be ok though, so this is just something to pay attention and does not show a bug. > >> + read_lock_irqsave(&rs->rs_recv_lock, flags); > >> + if (!rs->rs_cong_monitor) { > >> + /* When a congestion map was updated, we signal POLLIN for > >> + * "historical" reasons. Applications can also poll for > >> + * WRBAND instead. */ > >> + if (rds_cong_updated_since(&rs->rs_cong_track)) > >> + mask |= (POLLIN | POLLRDNORM | POLLWRBAND); > >> + } else { > >> + spin_lock(&rs->rs_lock); > > > > Is there a possibility to have lock iteraction problem with above > > rs_recv_lock read lock? > > I didn't see anywhere where they were being acquired in reverse order, > or simultaneously. This is the kind of thing that lockdep would find > immediately, right? I think I've got that turned on but I'll double > check. If lockdep entered the bad race, then yes, it will fire this up. I just wondered that we spin lock under the read lock, so some bad iteration with writelock may happen. > >> + cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) | > >> + be16_to_cpu(rs->rs_bound_port); > >> + > >> + if (needle < cmp) > > > > Should it use wrapping logic if some field overflows? > > Sorry, please explain? I meant that if there is an unsigned overflow this will suddenly become a small number, so network timestamping comparison logic can be used, but apparently neither address nor port are changed during the lifetime, so nothing special is needed. -- Evgeniy Polyakov From nicolas.morey-chaisemartin at ext.bull.net Thu Jan 29 08:40:56 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Thu, 29 Jan 2009 17:40:56 +0100 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c : Fixed bug on index port order incrementation Message-ID: <4981DC18.9030400@ext.bull.net> Hello, While doing some routing analysis on fat tree using ibsim we found a "bug" in the fat-tree algorithm. Problem happens with a 4 level Fat tree as below: L3 L3 ___________________|__|____________________ / / \ \ <= All the L2 are connected on 2 L3 switches L2-1 L2-2 L2-1 L2-2 / / \ \ <== The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning. L1 L1 L1 L1 /|\ /|\ /|\ /|\ ==Fully mixed to L1== ==Fully mixed to L1== <=== We have multiple set. In each set, all L0 lead to all L1 of their set. L0 L0 L0 L0 / \ / \ / \ / \ CN CN .. CN CN .... CN CN .. CN CN To detail: We have a bunch of sets. Each set contains compute node, L0 and L1 switches. Plus a common top of L2 and L3 switches. In each set, there are groups of compute nodes. Each group is connected to a single L0 switch. In a given set, all L0 are connected to all L1. The Nth L1 of a set is connected to the Nth L2 and only to this one. (so through a L2, the Nth L1 can only see the Nth L1 of the other sets) All the L2 are connected to a couple of L3. If we dont put the L3. We have a perfectly equilibrated fat tree and well equilibrated routes. But when we add the L3, it introduce a huge difference. As it is not necessary, no route is going through L3 (which is fine). However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 is twice overused (compared to the equilibrate state). This comes from the down_port_groups_idx which is incremented each time the algorithm goes down through a node whether it creates routes to HCA (port != switch) or not. As route coming up from a L1 reaches only one L2, the algorithm goes through all the other L2 while going down, incrementing their index. Our case here is a bit specific but in a case where your L1 doesn't have full connectivity to all your L2, and another switch rank above, the problem may appear. To avoid this problem, I've changed the __osm_ftree_fabric_route_upgoing_by_going_down function so it returns a value to indicate if routes to HCA (in fact to leaf switch) were created. With this information, we only increase the index when the algorithm has created routes to HCA. After applying this patch and measuring the link usage, we are at perfect equilibrium (L2<->L3 links are still not used but that is to be expected). Signed-off-by: Nicolas Morey-Chaisemartin --- opensm/opensm/osm_ucast_ftree.c | 23 ++++++++++++++--------- 1 files changed, 14 insertions(+), 9 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: 503dc4d590f8545d1846b2daa4d39c80fc5ac018.diff Type: text/x-patch Size: 2849 bytes Desc: not available URL: From sean.hefty at intel.com Thu Jan 29 09:32:27 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 29 Jan 2009 09:32:27 -0800 Subject: [ofa-general] [PATCH RFC] libibumad: give up cpu time slice if write() falied In-Reply-To: <4981CAAD.1090109@dev.mellanox.co.il> References: <4981CAAD.1090109@dev.mellanox.co.il> Message-ID: <53D9D3CDA6834FF6B6C8500810D20A85@amr.corp.intel.com> >Signed-off-by: Yevgeny Kliteynik >--- > libibumad/src/umad.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > >diff --git a/libibumad/src/umad.c b/libibumad/src/umad.c >index 78b956d..a03a018 100644 >--- a/libibumad/src/umad.c >+++ b/libibumad/src/umad.c >@@ -814,6 +814,8 @@ umad_send(int fd, int agentid, void *umad, int length, > > DEBUG("write returned %d != sizeof umad %zu + length %d (%m)", > n, umad_size(), length); >+ >+ usleep(20000); > if (!errno) > errno = EIO; > return -EIO; I think this should be handled by the application, and not placed into the lowest level library. - Sean From jgunthorpe at obsidianresearch.com Thu Jan 29 09:53:54 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 29 Jan 2009 10:53:54 -0700 Subject: [ofa-general] [PATCH RFC] libibumad: give up cpu time slice if write() falied In-Reply-To: <4981CAAD.1090109@dev.mellanox.co.il> References: <4981CAAD.1090109@dev.mellanox.co.il> Message-ID: <20090129175354.GZ7618@obsidianresearch.com> On Thu, Jan 29, 2009 at 05:26:37PM +0200, Yevgeny Kliteynik wrote: > Then I added 20 msec sleep after write() failure to make it give up the cpu time > slice, and I saw some improvement. When I added sleep of several seconds, the problem > disappeared completely, but then of course, the client's transaction will fail with > timeout unless you specifically increase the transaction timeout on the client side. Poll or a blocking fd would be a better choice than a sleep... -- Jason Gunthorpe (780)4406067x832 Chief Technology Officer, Obsidian Research Corp Edmonton, Canada From sashak at voltaire.com Thu Jan 29 10:17:47 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 Jan 2009 20:17:47 +0200 Subject: [ofa-general] Re: [PATCH RFC] libibumad: give up cpu time slice if write() falied In-Reply-To: <4981CAAD.1090109@dev.mellanox.co.il> References: <4981CAAD.1090109@dev.mellanox.co.il> Message-ID: <20090129181747.GM8534@sashak.voltaire.com> Hi Yevgeny, On 17:26 Thu 29 Jan , Yevgeny Kliteynik wrote: > > While running opensm on a single-core CPU I've noticed the following problem: > when SA is stressed with many SA queries (such as when you run "osmtest -f f" > on a multi-core CPU machine), sometimes opensm fails to send responses. > It appears that send buffer gets full, and write() fails. What is an errno? > Since the CPU has a single core, when the sender thread retries to send the > same packet, it fails again and again, because the driver didn't have the chance > to transmit something from the send buffer. Why not? DMA doesn't require CPU cycles. And even if it is not DMA then an user space running thread will be rescheduled some days. > Then I added 20 msec sleep after write() failure to make it give up the cpu time > slice, and I saw some improvement. When I added sleep of several seconds, the problem > disappeared completely, but then of course, the client's transaction will fail with > timeout unless you specifically increase the transaction timeout on the client side. > > So, what do you think about the following patch? If errno is EAGAIN (and I would expect that it is) then you can handle this in an upper layer application (if it is relevant at all), and limiting library is not a good idea IMO. For example how I would make stress-test then or benchmark? And what is the meaning of this "magic" number 20 ms? Sasha From sashak at voltaire.com Thu Jan 29 10:22:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 Jan 2009 20:22:24 +0200 Subject: [ofa-general] [PATCH RFC] libibumad: give up cpu time slice if write() falied In-Reply-To: <20090129175354.GZ7618@obsidianresearch.com> References: <4981CAAD.1090109@dev.mellanox.co.il> <20090129175354.GZ7618@obsidianresearch.com> Message-ID: <20090129182217.GN8534@sashak.voltaire.com> On 10:53 Thu 29 Jan , Jason Gunthorpe wrote: > On Thu, Jan 29, 2009 at 05:26:37PM +0200, Yevgeny Kliteynik wrote: > > > Then I added 20 msec sleep after write() failure to make it give up the cpu time > > slice, and I saw some improvement. When I added sleep of several seconds, the problem > > disappeared completely, but then of course, the client's transaction will fail with > > timeout unless you specifically increase the transaction timeout on the client side. > > Poll or a blocking fd would be a better choice than a sleep... True. (However in libibumad fd is opened with O_NONBLOCK flag - I have no a good understanding yet why it was done so (the change is from 2005).) Sasha From sashak at voltaire.com Thu Jan 29 10:29:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 Jan 2009 20:29:27 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_ftree.c: fix full topology dump In-Reply-To: <4981CCCE.7030104@dev.mellanox.co.il> References: <4981CCCE.7030104@dev.mellanox.co.il> Message-ID: <20090129182927.GO8534@sashak.voltaire.com> On 17:35 Thu 29 Jan , Yevgeny Kliteynik wrote: > Sasha, > > Full topology dump was missing leaf switches - fixing. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sumeet.lahorani at oracle.com Thu Jan 29 10:53:55 2009 From: sumeet.lahorani at oracle.com (Sumeet Lahorani) Date: Thu, 29 Jan 2009 10:53:55 -0800 Subject: [ofa-general] Affinitization of HCAs Message-ID: <4981FB43.30607@oracle.com> Hi, On NUMA systems with multiple HCAs, it is possible that each HCA has affinity to a different NUMA node. Is the affinity of the HCA exposed in any manner to user land? Also, do IB drivers in OFED 1.3.1 (or latest OFED) take advantage the HCA affinity? - Sumeet From sean.hefty at intel.com Thu Jan 29 11:35:31 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 29 Jan 2009 11:35:31 -0800 Subject: [ofa-general] [ib-mgmt] duplicate definitions Message-ID: Sasha, Stan came across an issue while porting saquery to Windows resulting from duplicate definitions for IB_MAD_METHOD_*. These are defined as enums in mad.h, but as #define values in ib_types.h. Do you have some preferred way to remove the duplicate defines? Can ib_types.h include mad.h and just remove the duplicates? (There may be other duplicate definition problems, but these are the only ones we've hit so far.) - Sean From andy.grover at gmail.com Thu Jan 29 12:22:49 2009 From: andy.grover at gmail.com (Andrew Grover) Date: Thu, 29 Jan 2009 12:22:49 -0800 Subject: [ofa-general] ***SPAM*** Re: [PATCH 01/21] RDS: Socket interface In-Reply-To: <20090126.201104.33429150.davem@davemloft.net> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <1233022678-9259-2-git-send-email-andy.grover@oracle.com> <20090126.201104.33429150.davem@davemloft.net> Message-ID: On Mon, Jan 26, 2009 at 8:11 PM, David Miller wrote: > Socket family implementations do not belong under the > infiniband subdirectory. > > Put it under net/ instead. > > I don't care what the interdependencies happen to be. Roland, so you're ok with infiniband code under net/ ? (or should I split it up??) Thanks -- Regards -- Andy From sashak at voltaire.com Thu Jan 29 12:29:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 29 Jan 2009 22:29:54 +0200 Subject: [ofa-general] Re: [ib-mgmt] duplicate definitions In-Reply-To: References: Message-ID: <20090129202954.GQ8534@sashak.voltaire.com> Hi Sean, On 11:35 Thu 29 Jan , Sean Hefty wrote: > > Stan came across an issue while porting saquery to Windows resulting from > duplicate definitions for IB_MAD_METHOD_*. These are defined as enums in mad.h, > but as #define values in ib_types.h. Why it should be a problem? ib_types.h is included in saquery.c after mad.h and as far as I understand "defines" will work instead of "enums". Opposite inclusion order could cause the problem of course. > Do you have some preferred way to remove > the duplicate defines? Can ib_types.h include mad.h and just remove the > duplicates? It is not an easy - libibmad and opensm are not dependent packages. Both have a reach history ( :( ) and external usage. Introducing dependency here would be noisy thing. And it is not really needed for any other tool (I think saquery.c is only exception where mad.h and ib_types.h are used together). Sasha From kliteyn at dev.mellanox.co.il Thu Jan 29 13:04:43 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 29 Jan 2009 23:04:43 +0200 Subject: [ofa-general] Re: [PATCH RFC] libibumad: give up cpu time slice if write() falied In-Reply-To: <20090129181747.GM8534@sashak.voltaire.com> References: <4981CAAD.1090109@dev.mellanox.co.il> <20090129181747.GM8534@sashak.voltaire.com> Message-ID: <498219EB.2090001@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 17:26 Thu 29 Jan , Yevgeny Kliteynik wrote: >> While running opensm on a single-core CPU I've noticed the following problem: >> when SA is stressed with many SA queries (such as when you run "osmtest -f f" >> on a multi-core CPU machine), sometimes opensm fails to send responses. >> It appears that send buffer gets full, and write() fails. > > What is an errno? write() returns -1, errno is EINVAL >> Since the CPU has a single core, when the sender thread retries to send the >> same packet, it fails again and again, because the driver didn't have the chance >> to transmit something from the send buffer. > > Why not? DMA doesn't require CPU cycles. And even if it is not DMA then > an user space running thread will be rescheduled some days. > >> Then I added 20 msec sleep after write() failure to make it give up the cpu time >> slice, and I saw some improvement. When I added sleep of several seconds, the problem >> disappeared completely, but then of course, the client's transaction will fail with >> timeout unless you specifically increase the transaction timeout on the client side. >> >> So, what do you think about the following patch? > > If errno is EAGAIN (and I would expect that it is) then you can handle > this in an upper layer application (if it is relevant at all), Since the errno is EINVAL, the application has no way knowing what really happened. But probably the errno in this specific case should be fixed - EAGAIN definitely makes more sense. > and limiting library is not a good idea IMO. OK, agree -- Yevgeny > For example how I would make > stress-test then or benchmark? And what is the meaning of this "magic" > number 20 ms? > > Sasha > From kliteyn at dev.mellanox.co.il Thu Jan 29 13:07:11 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 29 Jan 2009 23:07:11 +0200 Subject: [ofa-general] [PATCH RFC] libibumad: give up cpu time slice if write() falied In-Reply-To: <53D9D3CDA6834FF6B6C8500810D20A85@amr.corp.intel.com> References: <4981CAAD.1090109@dev.mellanox.co.il> <53D9D3CDA6834FF6B6C8500810D20A85@amr.corp.intel.com> Message-ID: <49821A7F.1060706@dev.mellanox.co.il> Sean Hefty wrote: >> Signed-off-by: Yevgeny Kliteynik >> + >> + usleep(20000); >> if (!errno) >> errno = EIO; >> return -EIO; > > I think this should be handled by the application, and not placed into the > lowest level library. Agree -- Yevgeny > - Sean > > From rdreier at cisco.com Thu Jan 29 13:46:04 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 29 Jan 2009 13:46:04 -0800 Subject: [ofa-general] Affinitization of HCAs In-Reply-To: <4981FB43.30607@oracle.com> (Sumeet Lahorani's message of "Thu, 29 Jan 2009 10:53:55 -0800") References: <4981FB43.30607@oracle.com> Message-ID: > On NUMA systems with multiple HCAs, it is possible that each HCA has > affinity to a different NUMA node. Is the affinity of the HCA exposed > in any manner to user land? /sys/devices/.../numa_node gives the affinity for every PCI device in the system. > Also, do IB drivers in OFED 1.3.1 (or latest OFED) take advantage the > HCA affinity? Not really... of course userspace could select the "closest" HCA but I don't think anyone has coded that. - R. From rdreier at cisco.com Thu Jan 29 13:47:03 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 29 Jan 2009 13:47:03 -0800 Subject: [ofa-general] Re: [PATCH 17/21] RDS/IB: Receive datagrams via IB In-Reply-To: <200901292202.28625.okir@suse.de> (Olaf Kirch's message of "Thu, 29 Jan 2009 22:02:28 +0100") References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <49811278.3050806@oracle.com> <200901292202.28625.okir@suse.de> Message-ID: > > > This is racy. You check if you're at the limit, do the allocation, and > > > then increment the atomic rds_ib_allocation count. So many threads can > > > pass the atomic_read() test and then take you over the limit. If you > > > want to make it safe then you could do atomic_inc_return() and check if > > > that took you over the limit. > > > > Woah, yup, thanks. > > The refill code used to be single-threaded; and I think it still is. So > this can't race I think So you don't need the atomic op at all? - R. From rdreier at cisco.com Thu Jan 29 14:01:02 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 29 Jan 2009 14:01:02 -0800 Subject: [ofa-general] Re: [ewg] [PATCH] ib_core: save process's virtual address in struct ib_umem In-Reply-To: <20090129154642.GA26302@mtls03> (Eli Cohen's message of "Thu, 29 Jan 2009 17:46:42 +0200") References: <20090125094506.GA19444@mtls03> <20090126182840.GA22015@mtls03> <20090129152425.GA30933@mtls03> <20090129154642.GA26302@mtls03> Message-ID: > It does not help. The problem with powerpc is that HPAGE_SHIFT is an > unexported variable and for ia64 it's hpage_shift. I see. hpage_shift is exported on ia64, so that should be OK. And I guess for powerpc it is just a matter of adding an export so we can use it in a module. - R. From richard.frank at oracle.com Thu Jan 29 14:27:13 2009 From: richard.frank at oracle.com (Richard Frank) Date: Thu, 29 Jan 2009 17:27:13 -0500 Subject: [ofa-general] Affinitization of HCAs In-Reply-To: References: <4981FB43.30607@oracle.com> Message-ID: <49822D41.7040108@oracle.com> What are the implications for registering memory with HCAs on NUMA systems ....if any ? Roland Dreier wrote: > > On NUMA systems with multiple HCAs, it is possible that each HCA has > > affinity to a different NUMA node. Is the affinity of the HCA exposed > > in any manner to user land? > > /sys/devices/.../numa_node gives the affinity for every PCI device in > the system. > > > Also, do IB drivers in OFED 1.3.1 (or latest OFED) take advantage the > > HCA affinity? > > Not really... of course userspace could select the "closest" HCA but I > don't think anyone has coded that. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Thu Jan 29 14:49:22 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 29 Jan 2009 14:49:22 -0800 Subject: [ofa-general] Affinitization of HCAs In-Reply-To: <49822D41.7040108@oracle.com> (Richard Frank's message of "Thu, 29 Jan 2009 17:27:13 -0500") References: <4981FB43.30607@oracle.com> <49822D41.7040108@oracle.com> Message-ID: > What are the implications for registering memory with HCAs on NUMA > systems ....if any ? None that I can think of. - R. From richard.frank at oracle.com Thu Jan 29 15:29:15 2009 From: richard.frank at oracle.com (Richard Frank) Date: Thu, 29 Jan 2009 18:29:15 -0500 Subject: [ofa-general] Affinitization of HCAs In-Reply-To: References: <4981FB43.30607@oracle.com> <49822D41.7040108@oracle.com> Message-ID: <49823BCB.2090301@oracle.com> A process on cpu A can register a buffer in memory local to cpu B with NIC that has affinity for cpu c ? Roland Dreier wrote: > > What are the implications for registering memory with HCAs on NUMA > > systems ....if any ? > > None that I can think of. > > - R. > From okir at suse.de Thu Jan 29 13:02:28 2009 From: okir at suse.de (Olaf Kirch) Date: Thu, 29 Jan 2009 22:02:28 +0100 Subject: [ofa-general] Re: [PATCH 17/21] RDS/IB: Receive datagrams via IB In-Reply-To: <49811278.3050806@oracle.com> References: <1233022678-9259-1-git-send-email-andy.grover@oracle.com> <49811278.3050806@oracle.com> Message-ID: <200901292202.28625.okir@suse.de> On Thursday 29 January 2009 03:20:40 Andy Grover wrote: > > This is racy. You check if you're at the limit, do the allocation, and > > then increment the atomic rds_ib_allocation count. So many threads can > > pass the atomic_read() test and then take you over the limit. If you > > want to make it safe then you could do atomic_inc_return() and check if > > that took you over the limit. > > Woah, yup, thanks. The refill code used to be single-threaded; and I think it still is. So this can't race I think Olaf From rdreier at cisco.com Thu Jan 29 16:17:57 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 29 Jan 2009 16:17:57 -0800 Subject: [ofa-general] Affinitization of HCAs In-Reply-To: <49823BCB.2090301@oracle.com> (Richard Frank's message of "Thu, 29 Jan 2009 18:29:15 -0500") References: <4981FB43.30607@oracle.com> <49822D41.7040108@oracle.com> <49823BCB.2090301@oracle.com> Message-ID: > A process on cpu A can register a buffer in memory local to cpu B with > NIC that has affinity for cpu c ? Sure, why not? - R. From jmulik at desu.edu Thu Jan 29 17:06:06 2009 From: jmulik at desu.edu (Jaiwant Mulik) Date: Thu, 29 Jan 2009 20:06:06 -0500 Subject: [ofa-general] ***SPAM*** loopback in cxgb3? Message-ID: Hi all, Looking at the list archives I found some discussion on (the lack of) hw loopback support for cxgb3. Does cxgb3 support loopback? I cannot seem to get it to work with rping. Regards, Jaiwant. ------------------------------------------------------------------ Assistant Professor Computer and Information Sciences Department Delaware State University, Dover, DE (302) 857-7910/6640, http://netlab.cis.desu.edu ------------------------------------------------------------------ Lekin woh zindagi hi kya jisme koi namumkin sapna na ho? From swise at opengridcomputing.com Thu Jan 29 17:52:33 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 29 Jan 2009 19:52:33 -0600 Subject: [ofa-general] ***SPAM*** loopback in cxgb3? In-Reply-To: References: Message-ID: <49825D61.6040507@opengridcomputing.com> Jaiwant Mulik wrote: > Hi all, > > Looking at the list archives I found some discussion on (the lack of) > hw loopback support for cxgb3. > > Does cxgb3 support loopback? I cannot seem to get it to work with rping. > cxgb3 does not support rdma loopback. It should fail the rdma_connect() call... Steve. From jaiwant at mulik.com Thu Jan 29 18:03:31 2009 From: jaiwant at mulik.com (Jaiwant Mulik) Date: Thu, 29 Jan 2009 21:03:31 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** loopback in cxgb3? In-Reply-To: <07894AD5756EC14BB4573782604815BD451C70A1AF@MAILBOX.desu.edu> References: <07894AD5756EC14BB4573782604815BD451C70A1AF@MAILBOX.desu.edu> Message-ID: <2862F0DE-D6D1-44E5-9970-29F8C0EAC96F@mulik.com> and indeed it does ... just wanted to make sure that I was not doing anything incorrect. On Jan 29, 2009, at 8:49 PM, Steve Wise wrote: > Jaiwant Mulik wrote: >> Hi all, >> >> Looking at the list archives I found some discussion on (the lack of) >> hw loopback support for cxgb3. >> >> Does cxgb3 support loopback? I cannot seem to get it to work with >> rping. >> > cxgb3 does not support rdma loopback. It should fail the > rdma_connect() > call... > > Steve. > > ------------------------------------------------------------------ Assistant Professor Computer and Information Sciences Department Delaware State University, Dover, DE (302) 857-7910/6640, http://netlab.cis.desu.edu ------------------------------------------------------------------ Lekin woh zindagi hi kya jisme koi namumkin sapna na ho? From vlad at lists.openfabrics.org Fri Jan 30 03:12:52 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 30 Jan 2009 03:12:52 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090130-0200 daily build status Message-ID: <20090130111253.35A09E6110A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From ofedrnicuser at yahoo.com Fri Jan 30 05:03:40 2009 From: ofedrnicuser at yahoo.com (Ofed User) Date: Fri, 30 Jan 2009 05:03:40 -0800 (PST) Subject: [ofa-general] ***SPAM*** vendor, device files not created under infiniband_verbs/uverbs0 Message-ID: <215301.57111.qm@web111209.mail.gq1.yahoo.com> Hi, I have registered a pseudo RNIC device with the stack. Stack doesn't create the (a) device/vendor (b) device/device files under infinband_verbs/uverbs0 directory. It does create rest of the files. Can someone tell me, which properties should I populate so that these files get created properly? #ls /sys/class/infiniband_verbs/uverbs0/ shows following files. abi_version  dev  ibdev  subsystem  uevent Bill -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Fri Jan 30 06:11:43 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 30 Jan 2009 16:11:43 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: CHECK_AND_SET_VAL() macro Message-ID: <20090130141143.GR8534@sashak.voltaire.com> CHECK_AND_SET_VAL() macro is SA query encoding helper - prevents many code duplications. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 124 ++++++++++------------------------------ 1 files changed, 31 insertions(+), 93 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index c091c49..7562e6d 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -790,6 +790,13 @@ static int parse_lid_and_ports(osm_bind_handle_t h, return 0; } +#define cl_hton8(x) (x) +#define CHECK_AND_SET_VAL(val, size, comp_with, target, name, mask) \ + if (val > comp_with) { \ + target = cl_hton##size(val); \ + comp_mask |= IB_##name##_COMPMASK_##mask; \ + } + /* * Get the portinfo records available with IsSM or IsSMdisabled CapabilityMask bit on. */ @@ -1066,11 +1073,7 @@ static int query_node_records(const struct query_cmd *q, parse_lid_and_ports(h, argv[0], &lid, NULL, NULL); memset(&nr, 0, sizeof(nr)); - - if (lid > 0) { - nr.lid = cl_hton16(lid); - comp_mask |= IB_NR_COMPMASK_LID; - } + CHECK_AND_SET_VAL(lid, 16, 0, nr.lid, NR, LID); status = get_any_records(h, IB_MAD_ATTR_NODE_RECORD, 0, comp_mask, &nr, ib_get_attr_offset(sizeof(nr)), 0); @@ -1095,15 +1098,8 @@ static int query_portinfo_records(const struct query_cmd *q, parse_lid_and_ports(h, argv[0], &lid, &port, NULL); memset(&pir, 0, sizeof(pir)); - - if (lid > 0) { - pir.lid = cl_hton16(lid); - comp_mask |= IB_PIR_COMPMASK_LID; - } - if (port >= 0) { - pir.port_num = port; - comp_mask |= IB_PIR_COMPMASK_PORTNUM; - } + CHECK_AND_SET_VAL(lid, 16, 0, pir.lid, PIR, LID); + CHECK_AND_SET_VAL(port, 8, -1, pir.port_num, PIR, PORTNUM); status = get_any_records(h, IB_MAD_ATTR_PORTINFO_RECORD, 0, comp_mask, &pir, ib_get_attr_offset(sizeof(pir)), 0); @@ -1170,23 +1166,10 @@ static int query_link_records(const struct query_cmd *q, parse_lid_and_ports(h, argv[1], &to_lid, &to_port, NULL); memset(&lr, 0, sizeof(lr)); - - if (from_lid > 0) { - lr.from_lid = cl_hton16(from_lid); - comp_mask |= IB_LR_COMPMASK_FROM_LID; - } - if (from_port >= 0) { - lr.from_port_num = from_port; - comp_mask |= IB_LR_COMPMASK_FROM_PORT; - } - if (to_lid > 0) { - lr.to_lid = cl_hton16(to_lid); - comp_mask |= IB_LR_COMPMASK_TO_LID; - } - if (to_port >= 0) { - lr.to_port_num = to_port; - comp_mask |= IB_LR_COMPMASK_TO_PORT; - } + CHECK_AND_SET_VAL(from_lid, 16, 0, lr.from_lid, LR, FROM_LID); + CHECK_AND_SET_VAL(from_port, 8, -1, lr.from_port_num, LR, FROM_PORT); + CHECK_AND_SET_VAL(to_lid, 16, 0, lr.to_lid, LR, TO_LID); + CHECK_AND_SET_VAL(to_port, 8, -1, lr.to_port_num, LR, TO_PORT); status = get_any_records(h, IB_MAD_ATTR_LINK_RECORD, 0, comp_mask, &lr, ib_get_attr_offset(sizeof(lr)), 0); @@ -1210,19 +1193,9 @@ static int query_sl2vl_records(const struct query_cmd *q, parse_lid_and_ports(h, argv[0], &lid, &in_port, &out_port); memset(&slvl, 0, sizeof(slvl)); - - if (lid > 0) { - slvl.lid = cl_hton16(lid); - comp_mask |= IB_SLVL_COMPMASK_LID; - } - if (in_port >= 0) { - slvl.in_port_num = in_port; - comp_mask |= IB_SLVL_COMPMASK_IN_PORT; - } - if (out_port >= 0) { - slvl.out_port_num = out_port; - comp_mask |= IB_SLVL_COMPMASK_OUT_PORT; - } + CHECK_AND_SET_VAL(lid, 16, 0, slvl.lid, SLVL, LID); + CHECK_AND_SET_VAL(in_port, 8, -1, slvl.in_port_num, SLVL, IN_PORT); + CHECK_AND_SET_VAL(out_port, 8, -1, slvl.out_port_num, SLVL, OUT_PORT); status = get_any_records(h, IB_MAD_ATTR_SLVL_RECORD, 0, comp_mask, &slvl, ib_get_attr_offset(sizeof(slvl)), 0); @@ -1246,19 +1219,9 @@ static int query_vlarb_records(const struct query_cmd *q, parse_lid_and_ports(h, argv[0], &lid, &port, &block); memset(&vlarb, 0, sizeof(vlarb)); - - if (lid > 0) { - vlarb.lid = cl_hton16(lid); - comp_mask |= IB_VLA_COMPMASK_LID; - } - if (port >= 0) { - vlarb.port_num = port; - comp_mask |= IB_VLA_COMPMASK_OUT_PORT; - } - if (block >= 0) { - vlarb.block_num = block; - comp_mask |= IB_VLA_COMPMASK_BLOCK; - } + CHECK_AND_SET_VAL(lid, 16, 0, vlarb.lid, VLA, LID); + CHECK_AND_SET_VAL(port, 8, -1, vlarb.port_num, VLA, OUT_PORT); + CHECK_AND_SET_VAL(block, 8, -1, vlarb.block_num, VLA, BLOCK); status = get_any_records(h, IB_MAD_ATTR_VLARB_RECORD, 0, comp_mask, &vlarb, ib_get_attr_offset(sizeof(vlarb)), 0); @@ -1282,19 +1245,9 @@ static int query_pkey_tbl_records(const struct query_cmd *q, parse_lid_and_ports(h, argv[0], &lid, &port, &block); memset(&pktr, 0, sizeof(pktr)); - - if (lid > 0) { - pktr.lid = cl_hton16(lid); - comp_mask |= IB_PKEY_COMPMASK_LID; - } - if (port >= 0) { - pktr.port_num = port; - comp_mask |= IB_PKEY_COMPMASK_PORT; - } - if (block >= 0) { - pktr.block_num = cl_hton16(block); - comp_mask |= IB_PKEY_COMPMASK_BLOCK; - } + CHECK_AND_SET_VAL(lid, 16, 0, pktr.lid, PKEY, LID); + CHECK_AND_SET_VAL(port, 8, -1, pktr.port_num, PKEY, PORT); + CHECK_AND_SET_VAL(block, 16, -1, pktr.port_num, PKEY, BLOCK); status = get_any_records(h, IB_MAD_ATTR_PKEY_TBL_RECORD, 0, comp_mask, &pktr, ib_get_attr_offset(sizeof(pktr)), smkey); @@ -1318,15 +1271,8 @@ static int query_lft_records(const struct query_cmd *q, parse_lid_and_ports(h, argv[0], &lid, &block, NULL); memset(&lftr, 0, sizeof(lftr)); - - if (lid > 0) { - lftr.lid = cl_hton16(lid); - comp_mask |= IB_LFTR_COMPMASK_LID; - } - if (block >= 0) { - lftr.block_num = cl_hton16(block); - comp_mask |= IB_LFTR_COMPMASK_BLOCK; - } + CHECK_AND_SET_VAL(lid, 16, 0, lftr.lid, LFTR, LID); + CHECK_AND_SET_VAL(block, 16, -1, lftr.block_num, LFTR, BLOCK); status = get_any_records(h, IB_MAD_ATTR_LFT_RECORD, 0, comp_mask, &lftr, ib_get_attr_offset(sizeof(lftr)), 0); @@ -1344,26 +1290,18 @@ static int query_mft_records(const struct query_cmd *q, ib_mft_record_t mftr; ib_net64_t comp_mask = 0; int lid = 0, block = -1, position = -1; + uint16_t pos = 0; ib_api_status_t status; if (argc > 0) parse_lid_and_ports(h, argv[0], &lid, &position, &block); memset(&mftr, 0, sizeof(mftr)); - - if (lid > 0) { - mftr.lid = cl_hton16(lid); - comp_mask |= IB_MFTR_COMPMASK_LID; - } - if (position >= 0) { - mftr.position_block_num = cl_hton16(position << 12); - comp_mask |= IB_MFTR_COMPMASK_POSITION; - } - if (block >= 0) { - mftr.position_block_num |= - cl_hton16(block & IB_MCAST_BLOCK_ID_MASK_HO); - comp_mask |= IB_MFTR_COMPMASK_BLOCK; - } + CHECK_AND_SET_VAL(lid, 16, 0, mftr.lid, MFTR, LID); + CHECK_AND_SET_VAL(block, 16, -1, mftr.position_block_num, MFTR, BLOCK); + mftr.position_block_num &= cl_hton16(IB_MCAST_BLOCK_ID_MASK_HO); + CHECK_AND_SET_VAL(position, 8, -1, pos, MFTR, POSITION); + mftr.position_block_num |= cl_hton16(pos << 12); status = get_any_records(h, IB_MAD_ATTR_MFT_RECORD, 0, comp_mask, &mftr, ib_get_attr_offset(sizeof(mftr)), 0); -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Fri Jan 30 06:14:16 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 30 Jan 2009 16:14:16 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: adding query params In-Reply-To: <20090130141143.GR8534@sashak.voltaire.com> References: <20090130141143.GR8534@sashak.voltaire.com> Message-ID: <20090130141416.GS8534@sashak.voltaire.com> This adds query params structure - currently it has only slid, dlid, sgid and dgid and used by PathRecord query. Will be extended by other parameters and can be used in other queries. Also using option allows wildcards now ( 'saquery --src-to-dst 2:' ). Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 212 ++++++++++++++++++++++------------------ 1 files changed, 116 insertions(+), 96 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 7562e6d..414f0e8 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -56,12 +56,17 @@ #include "ibdiag_common.h" +struct query_params { + ib_gid_t sgid, dgid; + uint16_t slid, dlid; +}; + struct query_cmd { const char *name, *alias; ib_net16_t query_type; const char *usage; int (*handler) (const struct query_cmd * q, osm_bind_handle_t h, - int argc, char *argv[]); + struct query_params *p, int argc, char *argv[]); }; static char *node_name_map_file = NULL; @@ -96,6 +101,12 @@ int requested_lid_flag = 0; ib_net64_t requested_guid = 0; int requested_guid_flag = 0; +static unsigned valid_gid(ib_gid_t *gid) +{ + ib_gid_t zero_gid = { }; + return memcmp(&zero_gid, gid, sizeof(*gid)); +} + static void format_buf(char *in, char *out, unsigned size) { unsigned i; @@ -797,6 +808,12 @@ static int parse_lid_and_ports(osm_bind_handle_t h, comp_mask |= IB_##name##_COMPMASK_##mask; \ } +#define CHECK_AND_SET_GID(val, target, name, mask) \ + if (valid_gid(&(val))) { \ + memcpy(&(target), &(val), sizeof(val)); \ + comp_mask |= IB_##name##_COMPMASK_##mask; \ + } + /* * Get the portinfo records available with IsSM or IsSMdisabled CapabilityMask bit on. */ @@ -861,15 +878,14 @@ static ib_api_status_t print_node_records(osm_bind_handle_t h) } static ib_api_status_t -get_print_path_rec_lid(osm_bind_handle_t h, - ib_net16_t src_lid, ib_net16_t dst_lid) +get_print_path_rec_lid(osm_bind_handle_t h, struct query_params *p) { osmv_query_req_t req; osmv_lid_pair_t lid_pair; ib_api_status_t status; - lid_pair.src_lid = cl_hton16(src_lid); - lid_pair.dest_lid = cl_hton16(dst_lid); + lid_pair.src_lid = cl_hton16(p->slid); + lid_pair.dest_lid = cl_hton16(p->dlid); memset(&req, 0, sizeof(req)); @@ -893,21 +909,21 @@ get_print_path_rec_lid(osm_bind_handle_t h, return (result.status); } status = result.status; + printf("Path record for %u -> %u\n", p->slid, p->dlid); dump_results(&result, dump_path_record); return_mad(); return (status); } static ib_api_status_t -get_print_path_rec_gid(osm_bind_handle_t h, - const ib_gid_t * src_gid, const ib_gid_t * dst_gid) +get_print_path_rec_gid(osm_bind_handle_t h, struct query_params *p) { osmv_query_req_t req; osmv_gid_pair_t gid_pair; ib_api_status_t status; - gid_pair.src_gid = *src_gid; - gid_pair.dest_gid = *dst_gid; + gid_pair.src_gid = p->sgid; + gid_pair.dest_gid = p->dgid; memset(&req, 0, sizeof(req)); @@ -968,13 +984,21 @@ static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h) return (status); } -static int query_path_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_path_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { - ib_net16_t attr_offset = ib_get_attr_offset(sizeof(ib_path_rec_t)); + ib_path_rec_t pr; + ib_net64_t comp_mask = 0; ib_api_status_t status; - status = get_all_records(h, IB_MAD_ATTR_PATH_RECORD, attr_offset, 0); + memset(&pr, 0, sizeof(pr)); + CHECK_AND_SET_GID(p->sgid, pr.sgid, PR, SGID); + CHECK_AND_SET_GID(p->dgid, pr.dgid, PR, DGID); + CHECK_AND_SET_VAL(p->slid, 16, 0, pr.slid, PR, SLID); + CHECK_AND_SET_VAL(p->dlid, 16, 0, pr.dlid, PR, DLID); + + status = get_any_records(h, IB_MAD_ATTR_PATH_RECORD, 0, comp_mask, + &pr, ib_get_attr_offset(sizeof(pr)), 0); if (status != IB_SUCCESS) return (status); @@ -1055,14 +1079,14 @@ static ib_api_status_t print_multicast_group_records(osm_bind_handle_t h) return (status); } -static int query_class_port_info(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_class_port_info(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { return get_print_class_port_info(h); } -static int query_node_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_node_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { ib_node_record_t nr; ib_net64_t comp_mask = 0; @@ -1087,7 +1111,8 @@ static int query_node_records(const struct query_cmd *q, } static int query_portinfo_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) + osm_bind_handle_t h, struct query_params *p, + int argc, char *argv[]) { ib_portinfo_record_t pir; ib_net64_t comp_mask = 0; @@ -1112,14 +1137,14 @@ static int query_portinfo_records(const struct query_cmd *q, return 0; } -static int query_mcmember_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_mcmember_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { return print_multicast_member_records(h); } -static int query_service_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_service_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { ib_net16_t attr_offset = ib_get_attr_offset(sizeof(ib_service_record_t)); @@ -1135,7 +1160,8 @@ static int query_service_records(const struct query_cmd *q, } static int query_informinfo_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) + osm_bind_handle_t h, struct query_params *p, + int argc, char *argv[]) { ib_net16_t attr_offset = ib_get_attr_offset(sizeof(ib_inform_info_record_t)); @@ -1151,8 +1177,8 @@ static int query_informinfo_records(const struct query_cmd *q, return (status); } -static int query_link_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_link_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { ib_link_record_t lr; ib_net64_t comp_mask = 0; @@ -1181,8 +1207,8 @@ static int query_link_records(const struct query_cmd *q, return status; } -static int query_sl2vl_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_sl2vl_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { ib_slvl_table_record_t slvl; ib_net64_t comp_mask = 0; @@ -1207,8 +1233,8 @@ static int query_sl2vl_records(const struct query_cmd *q, return status; } -static int query_vlarb_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_vlarb_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { ib_vl_arb_table_record_t vlarb; ib_net64_t comp_mask = 0; @@ -1234,7 +1260,8 @@ static int query_vlarb_records(const struct query_cmd *q, } static int query_pkey_tbl_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) + osm_bind_handle_t h, struct query_params *p, + int argc, char *argv[]) { ib_pkey_table_record_t pktr; ib_net64_t comp_mask = 0; @@ -1259,8 +1286,8 @@ static int query_pkey_tbl_records(const struct query_cmd *q, return status; } -static int query_lft_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_lft_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { ib_lft_record_t lftr; ib_net64_t comp_mask = 0; @@ -1284,8 +1311,8 @@ static int query_lft_records(const struct query_cmd *q, return status; } -static int query_mft_records(const struct query_cmd *q, - osm_bind_handle_t h, int argc, char *argv[]) +static int query_mft_records(const struct query_cmd *q, osm_bind_handle_t h, + struct query_params *p, int argc, char *argv[]) { ib_mft_record_t mftr; ib_net64_t comp_mask = 0; @@ -1455,47 +1482,38 @@ enum saquery_command { static enum saquery_command command = SAQUERY_CMD_QUERY; static ib_net16_t query_type; -static char *src, *dst, *sgid, *dgid; +static char *src_lid, *dst_lid; static int process_opt(void *context, int ch, char *optarg) { + struct query_params *p = context; + switch (ch) { case 1: { - char *opt = strdup(optarg); - char *ch = strchr(opt, ':'); - if (!ch) { - fprintf(stderr, - "ERROR: --src-to-dst :\n"); + src_lid = strdup(optarg); + dst_lid = strchr(src_lid, ':'); + if (!dst_lid) ibdiag_show_usage(); - } - *ch++ = '\0'; - if (*opt) - src = strdup(opt); - if (*ch) - dst = strdup(ch); - free(opt); - command = SAQUERY_CMD_PATH_RECORD; - break; + *dst_lid++ = '\0'; } + command = SAQUERY_CMD_PATH_RECORD; + break; case 2: { - char *opt = strdup(optarg); - char *tok1 = strtok(opt, "-"); - char *tok2 = strtok(NULL, "\0"); - - if (tok1 && tok2) { - sgid = strdup(tok1); - dgid = strdup(tok2); - } else { - fprintf(stderr, - "ERROR: --sgid-to-dgid -\n"); + char *src_addr = strdup(optarg); + char *dst_addr = strchr(src_addr, '-'); + if (!dst_addr) ibdiag_show_usage(); - } - free(opt); - command = SAQUERY_CMD_PATH_RECORD; - break; + *dst_addr++ = '\0'; + if (inet_pton(AF_INET6, src_addr, &p->sgid) <= 0) + ibdiag_show_usage(); + if (inet_pton(AF_INET6, dst_addr, &p->dgid) <= 0) + ibdiag_show_usage(); + free(src_addr); } + command = SAQUERY_CMD_PATH_RECORD; + break; case 3: node_name_map_file = strdup(optarg); break; @@ -1508,7 +1526,7 @@ static int process_opt(void *context, int ch, char *optarg) smkey = cl_hton64(strtoull(optarg, NULL, 0)); break; case 'p': - command = SAQUERY_CMD_PATH_RECORD; + query_type = IB_MAD_ATTR_PATH_RECORD; break; case 'D': node_print_desc = ALL_DESC; @@ -1557,6 +1575,20 @@ static int process_opt(void *context, int ch, char *optarg) case 'x': query_type = IB_MAD_ATTR_LINK_RECORD; break; + case 5: + p->slid = strtoul(optarg, NULL, 0); + break; + case 6: + p->dlid = strtoul(optarg, NULL, 0); + break; + case 14: + if (inet_pton(AF_INET6, optarg, &p->sgid) <= 0) + ibdiag_show_usage(); + break; + case 15: + if (inet_pton(AF_INET6, optarg, &p->dgid) <= 0) + ibdiag_show_usage(); + break; default: return -1; } @@ -1567,8 +1599,8 @@ int main(int argc, char **argv) { char usage_args[1024]; osm_bind_handle_t h; + struct query_params params = { }; const struct query_cmd *q; - ib_net16_t src_lid, dst_lid; ib_api_status_t status; int n; @@ -1599,6 +1631,10 @@ int main(int argc, char **argv) {"smkey", 4, 1, "", "SA SM_Key value for the query." " If non-numeric value (like 'x') is specified then" " saquery will prompt for a value"}, + { "slid", 5, 1, "", "Source LID (PathRecord)" }, + { "dlid", 6, 1, "", "Destination LID (PathRecord)" }, + { "sgid", 14, 1, "", "Source GID (IPv6 format) (PathRecord)" }, + { "dgid", 15, 1, "", "Destination GID (IPv6 format) (PathRecord)" }, {} }; @@ -1618,7 +1654,7 @@ int main(int argc, char **argv) q = NULL; ibd_timeout = DEFAULT_SA_TIMEOUT_MS; - ibdiag_process_opts(argc, argv, NULL, "DLGs", opts, process_opt, + ibdiag_process_opts(argc, argv, ¶ms, "DLGs", opts, process_opt, usage_args, NULL); argc -= optind; @@ -1671,36 +1707,22 @@ int main(int argc, char **argv) h = get_bind_handle(); node_name_map = open_node_name_map(node_name_map_file); + if (src_lid && *src_lid) + params.slid = get_lid(h, src_lid); + if (dst_lid && *dst_lid) + params.dlid = get_lid(h, dst_lid); + switch (command) { case SAQUERY_CMD_NODE_RECORD: status = print_node_records(h); break; case SAQUERY_CMD_PATH_RECORD: - if (src && dst) { - src_lid = get_lid(h, src); - dst_lid = get_lid(h, dst); - printf("Path record for %s -> %s\n", src, dst); - if (src_lid == 0 || dst_lid == 0) - status = IB_UNKNOWN_ERROR; - else - status = - get_print_path_rec_lid(h, src_lid, dst_lid); - } else if (sgid && dgid) { - struct in6_addr src_addr, dst_addr; - - if (inet_pton(AF_INET6, sgid, &src_addr) <= 0) { - fprintf(stderr, "invalid src gid: %s\n", sgid); - exit(-1); - } - if (inet_pton(AF_INET6, dgid, &dst_addr) <= 0) { - fprintf(stderr, "invalid dst gid: %s\n", dgid); - exit(-1); - } - status = get_print_path_rec_gid(h, - (ib_gid_t *) & src_addr.s6_addr, - (ib_gid_t *) & dst_addr.s6_addr); - } else - status = query_path_records(q, h, 0, NULL); + if (params.slid && params.dlid) + status = get_print_path_rec_lid(h, ¶ms); + else if (valid_gid(¶ms.sgid) && valid_gid(¶ms.dgid)) + status = get_print_path_rec_gid(h, ¶ms); + else + status = query_path_records(q, h, ¶ms, 0, NULL); break; case SAQUERY_CMD_CLASS_PORT_INFO: status = get_print_class_port_info(h); @@ -1721,14 +1743,12 @@ int main(int argc, char **argv) ntohs(query_type)); status = IB_UNKNOWN_ERROR; } else - status = q->handler(q, h, argc, argv); + status = q->handler(q, h, ¶ms, argc, argv); break; } - if (src) - free(src); - if (dst) - free(dst); + if (src_lid) + free(src_lid); clean_up(); close_node_name_map(node_name_map); return (status); -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Fri Jan 30 06:15:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 30 Jan 2009 16:15:10 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: more params for Path and MCMember Records In-Reply-To: <20090130141416.GS8534@sashak.voltaire.com> References: <20090130141143.GR8534@sashak.voltaire.com> <20090130141416.GS8534@sashak.voltaire.com> Message-ID: <20090130141510.GT8534@sashak.voltaire.com> This adds many parameters for PathRecord and MCMemberRecord queries (such as mtu, rate, etc.). And slso extended MCMemberRecord dumper. Backward compatibility with existing options (-m, -g, --src-to-dst, --sgid-to-dgid) is preserved. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 192 ++++++++++++++++++++++++++++++++++++++-- 1 files changed, 186 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 414f0e8..ecb4480 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -57,8 +57,19 @@ #include "ibdiag_common.h" struct query_params { - ib_gid_t sgid, dgid; - uint16_t slid, dlid; + ib_gid_t sgid, dgid, gid, mgid; + uint16_t slid, dlid, mlid; + uint32_t flow_label; + int hop_limit; + uint8_t tclass; + int reversible, numb_path; + uint16_t pkey; + int qos_class, sl; + uint8_t mtu, rate, pkt_life; + uint32_t qkey; + uint8_t scope; + uint8_t join_state; + int proxy_join; }; struct query_cmd { @@ -304,6 +315,37 @@ static void dump_one_portinfo_record(void *data) cl_ntoh16(pir->lid), pir->port_num, pir->resv, buf2); } +static void dump_one_mcmember_record(void *data) +{ + char mgid[INET6_ADDRSTRLEN], gid[INET6_ADDRSTRLEN]; + ib_member_rec_t *mr = data; + uint32_t flow; + uint8_t sl, hop, scope, join; + ib_member_get_sl_flow_hop(mr->sl_flow_hop, &sl, &flow, &hop); + ib_member_get_scope_state(mr->scope_state, &scope, &join); + printf("MCMember Record dump:\n" + "\t\tMGID....................%s\n" + "\t\tPortGid.................%s\n" + "\t\tqkey....................0x%x\n" + "\t\tmlid....................0x%x\n" + "\t\tmtu.....................0x%x\n" + "\t\tTClass..................0x%x\n" + "\t\tpkey....................0x%x\n" + "\t\trate....................0x%x\n" + "\t\tpkt_life................0x%x\n" + "\t\tSL......................0x%x\n" + "\t\tFlowLabel...............0x%x\n" + "\t\tHopLimit................0x%x\n" + "\t\tScope...................0x%x\n" + "\t\tJoinState...............0x%x\n" + "\t\tProxyJoin...............0x%x\n", + inet_ntop(AF_INET6, mr->mgid.raw, mgid, sizeof(mgid)), + inet_ntop(AF_INET6, mr->port_gid.raw, gid, sizeof(gid)), + cl_ntoh32(mr->qkey), cl_ntoh16(mr->mlid), mr->mtu, mr->tclass, + cl_ntoh16(mr->pkey), mr->rate, mr->pkt_life, sl, + cl_ntoh32(flow), hop, scope, join, mr->proxy_join); +} + static void dump_multicast_group_record(void *data) { char gid_str[INET6_ADDRSTRLEN]; @@ -814,6 +856,13 @@ static int parse_lid_and_ports(osm_bind_handle_t h, comp_mask |= IB_##name##_COMPMASK_##mask; \ } +#define CHECK_AND_SET_VAL_AND_SEL(val, target, name, mask, sel) \ + if (val) { \ + target = val; \ + comp_mask |= IB_##name##_COMPMASK_##mask##sel; \ + comp_mask |= IB_##name##_COMPMASK_##mask; \ + } + /* * Get the portinfo records available with IsSM or IsSMdisabled CapabilityMask bit on. */ @@ -990,12 +1039,29 @@ static int query_path_records(const struct query_cmd *q, osm_bind_handle_t h, ib_path_rec_t pr; ib_net64_t comp_mask = 0; ib_api_status_t status; + uint32_t flow = 0; + uint16_t qos_class = 0; + uint8_t reversible = 0; memset(&pr, 0, sizeof(pr)); CHECK_AND_SET_GID(p->sgid, pr.sgid, PR, SGID); CHECK_AND_SET_GID(p->dgid, pr.dgid, PR, DGID); CHECK_AND_SET_VAL(p->slid, 16, 0, pr.slid, PR, SLID); CHECK_AND_SET_VAL(p->dlid, 16, 0, pr.dlid, PR, DLID); + CHECK_AND_SET_VAL(p->hop_limit, 32, -1, pr.hop_flow_raw, PR, HOPLIMIT); + CHECK_AND_SET_VAL(p->flow_label, 8, 0, flow, PR, FLOWLABEL); + pr.hop_flow_raw |= cl_hton32(flow << 8); + CHECK_AND_SET_VAL(p->tclass, 8, 0, pr.tclass, PR, TCLASS); + CHECK_AND_SET_VAL(p->reversible, 8, -1, reversible, PR, REVERSIBLE); + CHECK_AND_SET_VAL(p->numb_path, 8, -1, pr.num_path, PR, NUMBPATH); + pr.num_path |= reversible << 7; + CHECK_AND_SET_VAL(p->pkey, 16, 0, pr.pkey, PR, PKEY); + CHECK_AND_SET_VAL(p->sl, 16, -1, pr.qos_class_sl, PR, SL); + CHECK_AND_SET_VAL(p->qos_class, 16, -1, qos_class, PR, QOS_CLASS); + ib_path_rec_set_qos_class(&pr, qos_class); + CHECK_AND_SET_VAL_AND_SEL(p->mtu, pr.mtu, PR, MTU, SELEC); + CHECK_AND_SET_VAL_AND_SEL(p->rate, pr.rate, PR, RATE, SELEC); + CHECK_AND_SET_VAL_AND_SEL(p->pkt_life, pr.pkt_life, PR, PKTLIFETIME, SELEC); status = get_any_records(h, IB_MAD_ATTR_PATH_RECORD, 0, comp_mask, &pr, ib_get_attr_offset(sizeof(pr)), 0); @@ -1137,10 +1203,43 @@ static int query_portinfo_records(const struct query_cmd *q, return 0; } -static int query_mcmember_records(const struct query_cmd *q, osm_bind_handle_t h, - struct query_params *p, int argc, char *argv[]) +static int query_mcmember_records(const struct query_cmd *q, + osm_bind_handle_t h, struct query_params *p, + int argc, char *argv[]) { - return print_multicast_member_records(h); + ib_member_rec_t mr; + ib_net64_t comp_mask = 0; + ib_api_status_t status; + uint32_t flow = 0; + uint8_t sl = 0, hop = 0, scope = 0; + + memset(&mr, 0, sizeof(mr)); + CHECK_AND_SET_GID(p->mgid, mr.mgid, MCR, MGID); + CHECK_AND_SET_GID(p->gid, mr.port_gid, MCR, PORT_GID); + CHECK_AND_SET_VAL(p->mlid, 16, 0, mr.mlid, MCR, MLID); + CHECK_AND_SET_VAL(p->qkey, 32, 0, mr.qkey, MCR, QKEY); + CHECK_AND_SET_VAL_AND_SEL(p->mtu, mr.mtu, MCR, MTU, _SEL); + CHECK_AND_SET_VAL_AND_SEL(p->rate, mr.rate, MCR, RATE, _SEL); + CHECK_AND_SET_VAL_AND_SEL(p->pkt_life, mr.pkt_life, MCR, LIFE, _SEL); + CHECK_AND_SET_VAL(p->tclass, 8, 0, mr.tclass, MCR, TCLASS); + CHECK_AND_SET_VAL(p->pkey, 16, 0, mr.pkey, MCR, PKEY); + CHECK_AND_SET_VAL(p->sl, 8, -1, sl, MCR, SL); + CHECK_AND_SET_VAL(p->flow_label, 8, 0, flow, MCR, FLOW); + CHECK_AND_SET_VAL(p->hop_limit, 8, -1, hop, MCR, HOP); + mr.sl_flow_hop = ib_member_set_sl_flow_hop(sl, flow, hop); + CHECK_AND_SET_VAL(p->scope, 8, 0, scope, MCR, SCOPE); + CHECK_AND_SET_VAL(p->join_state, 8, 0, mr.scope_state, MCR, JOIN_STATE); + mr.scope_state |= scope << 4; + CHECK_AND_SET_VAL(p->proxy_join, 8, -1, mr.proxy_join, MCR, PROXY); + + status = get_any_records(h, IB_MAD_ATTR_MCMEMBER_RECORD, 0, comp_mask, + &mr, ib_get_attr_offset(sizeof(mr)), smkey); + if (status != IB_SUCCESS) + return status; + + dump_results(&result, dump_one_mcmember_record); + return_mad(); + return status; } static int query_service_records(const struct query_cmd *q, osm_bind_handle_t h, @@ -1581,6 +1680,9 @@ static int process_opt(void *context, int ch, char *optarg) case 6: p->dlid = strtoul(optarg, NULL, 0); break; + case 7: + p->mlid = strtoul(optarg, NULL, 0); + break; case 14: if (inet_pton(AF_INET6, optarg, &p->sgid) <= 0) ibdiag_show_usage(); @@ -1589,6 +1691,59 @@ static int process_opt(void *context, int ch, char *optarg) if (inet_pton(AF_INET6, optarg, &p->dgid) <= 0) ibdiag_show_usage(); break; + case 16: + if (inet_pton(AF_INET6, optarg, &p->gid) <= 0) + ibdiag_show_usage(); + break; + case 17: + if (inet_pton(AF_INET6, optarg, &p->mgid) <= 0) + ibdiag_show_usage(); + break; + case 'r': + p->reversible = strtoul(optarg, NULL, 0); + break; + case 'n': + p->numb_path = strtoul(optarg, NULL, 0); + break; + case 18: + p->pkey = strtoul(optarg, NULL, 0); + break; + case 'Q': + p->qos_class = strtoul(optarg, NULL, 0); + break; + case 19: + p->sl = strtoul(optarg, NULL, 0); + break; + case 'M': + p->mtu = strtoul(optarg, NULL, 0); + break; + case 'R': + p->rate = strtoul(optarg, NULL, 0); + break; + case 20: + p->pkt_life = strtoul(optarg, NULL, 0); + break; + case 'q': + p->qkey = strtoul(optarg, NULL, 0); + break; + case 'T': + p->tclass = strtoul(optarg, NULL, 0); + break; + case 'F': + p->flow_label = strtoul(optarg, NULL, 0); + break; + case 'H': + p->hop_limit = strtoul(optarg, NULL, 0); + break; + case 21: + p->scope = strtoul(optarg, NULL, 0); + break; + case 'J': + p->join_state = strtoul(optarg, NULL, 0); + break; + case 'X': + p->proxy_join = strtoul(optarg, NULL, 0); + break; default: return -1; } @@ -1599,7 +1754,14 @@ int main(int argc, char **argv) { char usage_args[1024]; osm_bind_handle_t h; - struct query_params params = { }; + struct query_params params = { + .hop_limit = -1, + .reversible = -1, + .numb_path = -1, + .qos_class = -1, + .sl = -1, + .proxy_join = -1, + }; const struct query_cmd *q; ib_api_status_t status; int n; @@ -1633,8 +1795,26 @@ int main(int argc, char **argv) " saquery will prompt for a value"}, { "slid", 5, 1, "", "Source LID (PathRecord)" }, { "dlid", 6, 1, "", "Destination LID (PathRecord)" }, + { "mlid", 7, 1, "", "Multicast LID (MCMemberRecord)" }, { "sgid", 14, 1, "", "Source GID (IPv6 format) (PathRecord)" }, { "dgid", 15, 1, "", "Destination GID (IPv6 format) (PathRecord)" }, + { "gid", 16, 1, "", "Port GID (MCMemberRecord)" }, + { "mgid", 17, 1, "", "Multicast GID (MCMemberRecord)" }, + { "reversible", 'r', 1, NULL, "Reversible path (PathRecord)" }, + { "numb_path", 'n', 1, NULL, "Number of paths (PathRecord)" }, + { "pkey", 18, 1, NULL, "P_Key (PathRecord, MCMemberRecord)" }, + { "qos_calss", 'Q', 1, NULL, "QoS Class (PathRecord)"}, + { "sl", 19, 1, NULL, "Service level (PathRecord, MCMemberRecord)" }, + { "mtu", 'M', 1, NULL, "MTU and selector (PathRecord, MCMemberRecord)" }, + { "rate", 'R', 1, NULL, "Rate and selector (PathRecord, MCMemberRecord)" }, + { "pkt_lifetime", 20, 1, NULL, "Packet lifetime and selector (PathRecord, MCMemberRecord)" }, + { "qkey", 'q', 1, NULL, "Q_Key (MCMemberRecord)" }, + { "tclass", 'T', 1, NULL, "Traffic Class (PathRecord, MCMemberRecord)" }, + { "flow_label", 'F', 1, NULL, "Flow Label (PathRecord, MCMemberRecord)" }, + { "hop_limit", 'H', 1, NULL, "Hop limit (PathRecord, MCMemberRecord)" }, + { "scope", 21, 1, NULL, "Scope (MCMemberRecord)" }, + { "join_state", 'J', 1, NULL, "Join state (MCMemberRecord)" }, + { "proxy_join", 'X', 1, NULL, "Proxy join (MCMemberRecord)" }, {} }; -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Fri Jan 30 06:15:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 30 Jan 2009 16:15:55 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: merge PathRecord query functions In-Reply-To: <20090130141510.GT8534@sashak.voltaire.com> References: <20090130141143.GR8534@sashak.voltaire.com> <20090130141416.GS8534@sashak.voltaire.com> <20090130141510.GT8534@sashak.voltaire.com> Message-ID: <20090130141555.GU8534@sashak.voltaire.com> In order to prevent code duplications and let to user to use all introduced query parameters merge all functions querying PathRecord - use single query_path_records(). Backward compatibility is preserved (actually is is setting NumbPath when --src-to-dst or --sgid-to-dgid options are used). Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 90 ++-------------------------------------- 1 files changed, 4 insertions(+), 86 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index ecb4480..5361184 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -926,81 +926,6 @@ static ib_api_status_t print_node_records(osm_bind_handle_t h) return (status); } -static ib_api_status_t -get_print_path_rec_lid(osm_bind_handle_t h, struct query_params *p) -{ - osmv_query_req_t req; - osmv_lid_pair_t lid_pair; - ib_api_status_t status; - - lid_pair.src_lid = cl_hton16(p->slid); - lid_pair.dest_lid = cl_hton16(p->dlid); - - memset(&req, 0, sizeof(req)); - - req.query_type = OSMV_QUERY_PATH_REC_BY_LIDS; - req.timeout_ms = ibd_timeout; - req.retry_cnt = 1; - req.flags = OSM_SA_FLAGS_SYNC; - req.query_context = NULL; - req.pfn_query_cb = query_res_cb; - req.p_query_input = (void *)&lid_pair; - req.sm_key = 0; - - if ((status = osmv_query_sa(h, &req)) != IB_SUCCESS) { - fprintf(stderr, "ERROR: Query SA failed: %s\n", - ib_get_err_str(status)); - return (status); - } - if (result.status != IB_SUCCESS) { - fprintf(stderr, "ERROR: Query result returned: %s\n", - ib_get_err_str(result.status)); - return (result.status); - } - status = result.status; - printf("Path record for %u -> %u\n", p->slid, p->dlid); - dump_results(&result, dump_path_record); - return_mad(); - return (status); -} - -static ib_api_status_t -get_print_path_rec_gid(osm_bind_handle_t h, struct query_params *p) -{ - osmv_query_req_t req; - osmv_gid_pair_t gid_pair; - ib_api_status_t status; - - gid_pair.src_gid = p->sgid; - gid_pair.dest_gid = p->dgid; - - memset(&req, 0, sizeof(req)); - - req.query_type = OSMV_QUERY_PATH_REC_BY_GIDS; - req.timeout_ms = ibd_timeout; - req.retry_cnt = 1; - req.flags = OSM_SA_FLAGS_SYNC; - req.query_context = NULL; - req.pfn_query_cb = query_res_cb; - req.p_query_input = (void *)&gid_pair; - req.sm_key = 0; - - if ((status = osmv_query_sa(h, &req)) != IB_SUCCESS) { - fprintf(stderr, "ERROR: Query SA failed: %s\n", - ib_get_err_str(status)); - return (status); - } - if (result.status != IB_SUCCESS) { - fprintf(stderr, "ERROR: Query result returned: %s\n", - ib_get_err_str(result.status)); - return (result.status); - } - status = result.status; - dump_results(&result, dump_path_record); - return_mad(); - return (status); -} - static ib_api_status_t get_print_class_port_info(osm_bind_handle_t h) { osmv_query_req_t req; @@ -1572,7 +1497,6 @@ static const struct query_cmd *find_query_by_type(ib_net16_t type) enum saquery_command { SAQUERY_CMD_QUERY, SAQUERY_CMD_NODE_RECORD, - SAQUERY_CMD_PATH_RECORD, SAQUERY_CMD_CLASS_PORT_INFO, SAQUERY_CMD_ISSM, SAQUERY_CMD_MCGROUPS, @@ -1596,7 +1520,8 @@ static int process_opt(void *context, int ch, char *optarg) ibdiag_show_usage(); *dst_lid++ = '\0'; } - command = SAQUERY_CMD_PATH_RECORD; + p->numb_path = 0x7f; + query_type = IB_MAD_ATTR_PATH_RECORD; break; case 2: { @@ -1611,7 +1536,8 @@ static int process_opt(void *context, int ch, char *optarg) ibdiag_show_usage(); free(src_addr); } - command = SAQUERY_CMD_PATH_RECORD; + p->numb_path = 0x7f; + query_type = IB_MAD_ATTR_PATH_RECORD; break; case 3: node_name_map_file = strdup(optarg); @@ -1896,14 +1822,6 @@ int main(int argc, char **argv) case SAQUERY_CMD_NODE_RECORD: status = print_node_records(h); break; - case SAQUERY_CMD_PATH_RECORD: - if (params.slid && params.dlid) - status = get_print_path_rec_lid(h, ¶ms); - else if (valid_gid(¶ms.sgid) && valid_gid(¶ms.dgid)) - status = get_print_path_rec_gid(h, ¶ms); - else - status = query_path_records(q, h, ¶ms, 0, NULL); - break; case SAQUERY_CMD_CLASS_PORT_INFO: status = get_print_class_port_info(h); break; -- 1.6.0.4.766.g6fc4a From sashak at voltaire.com Fri Jan 30 06:28:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 30 Jan 2009 16:28:24 +0200 Subject: [ofa-general] Re: [ib-mgmt] duplicate definitions In-Reply-To: <20090129202954.GQ8534@sashak.voltaire.com> References: <20090129202954.GQ8534@sashak.voltaire.com> Message-ID: <20090130142817.GV8534@sashak.voltaire.com> Hi Sean, On 22:29 Thu 29 Jan , Sasha Khapyorsky wrote: > On 11:35 Thu 29 Jan , Sean Hefty wrote: > > > > Stan came across an issue while porting saquery to Windows resulting from > > duplicate definitions for IB_MAD_METHOD_*. These are defined as enums in mad.h, > > but as #define values in ib_types.h. > > Why it should be a problem? ib_types.h is included in saquery.c after > mad.h and as far as I understand "defines" will work instead of "enums". > Opposite inclusion order could cause the problem of course. BTW is it true that WinOF uses its own version of ib_types.h? A lot of time ago I posted patch (below) which removes '#ifdef *WIN*' stuff there. Would be interested to know is it relevant yet? Sasha commit 467537604b56df11d8eec6cbf2fcb8d3963a6afc Author: Sasha Khapyorsky Date: Fri Nov 30 06:20:13 2007 +0200 opensm/ib_types.h: remove ifdef WIN conditions It was stated couple of times that in windows another instance of ib_types.h file is used. If so we don't need to keep those 'ifdef WIN' conditions here. Also this removes empty __ptr64 macro. Signed-off-by: Sasha Khapyorsky diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 09ec257..821c4c9 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -49,20 +49,9 @@ #endif /* __cplusplus */ BEGIN_C_DECLS -#if defined( WIN32 ) || defined( _WIN64 ) -#if defined( EXPORT_AL_SYMBOLS ) -#define OSM_EXPORT __declspec(dllexport) -#else -#define OSM_EXPORT __declspec(dllimport) -#endif -#define OSM_API __stdcall -#define OSM_CDECL __cdecl -#else #define OSM_EXPORT extern #define OSM_API #define OSM_CDECL -#define __ptr64 -#endif /****h* IBA Base/Constants * NAME * Constants @@ -8255,22 +8244,21 @@ typedef struct _ib_ioc_info { /* * The following definitions are shared between the Access Layer and VPD */ -typedef struct _ib_ca *__ptr64 ib_ca_handle_t; -typedef struct _ib_pd *__ptr64 ib_pd_handle_t; -typedef struct _ib_rdd *__ptr64 ib_rdd_handle_t; -typedef struct _ib_mr *__ptr64 ib_mr_handle_t; -typedef struct _ib_mw *__ptr64 ib_mw_handle_t; -typedef struct _ib_qp *__ptr64 ib_qp_handle_t; -typedef struct _ib_eec *__ptr64 ib_eec_handle_t; -typedef struct _ib_cq *__ptr64 ib_cq_handle_t; -typedef struct _ib_av *__ptr64 ib_av_handle_t; -typedef struct _ib_mcast *__ptr64 ib_mcast_handle_t; +typedef struct _ib_ca * ib_ca_handle_t; +typedef struct _ib_pd * ib_pd_handle_t; +typedef struct _ib_rdd * ib_rdd_handle_t; +typedef struct _ib_mr * ib_mr_handle_t; +typedef struct _ib_mw * ib_mw_handle_t; +typedef struct _ib_qp * ib_qp_handle_t; +typedef struct _ib_eec * ib_eec_handle_t; +typedef struct _ib_cq * ib_cq_handle_t; +typedef struct _ib_av * ib_av_handle_t; +typedef struct _ib_mcast * ib_mcast_handle_t; /* Currently for windows branch, use the extended version of ib special verbs struct in order to be compliant with Infinicon ib_types; later we'll change it to support OpenSM ib_types.h */ -#ifndef WIN32 /****d* Access Layer/ib_api_status_t * NAME * ib_api_status_t @@ -10712,8 +10700,4 @@ typedef struct _ib_ci_op { *****/ END_C_DECLS -#endif /* ndef WIN32 */ -#if defined( __WIN__ ) -#include -#endif #endif /* __IB_TYPES_H__ */ From shemminger at vyatta.com Fri Jan 30 08:25:23 2009 From: shemminger at vyatta.com (Stephen Hemminger) Date: Fri, 30 Jan 2009 08:25:23 -0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090130065721.GA4886@gondor.apana.org.au> References: <20090130065721.GA4886@gondor.apana.org.au> Message-ID: <20090130082523.0fb8fb60@extreme> On Fri, 30 Jan 2009 17:57:21 +1100 Herbert Xu wrote: > Roland Dreier wrote: > > > > OK, thanks... what confused me is that several other drivers also do > > skb_linearize() in their hard_start_xmit method... eg bnx2x, > > via-velocity, mv643xx_eth. So there are several other lurking bugs to > > deal with here I guess. > > I don't know about the rest but bnx2x is certainly OK since it > only does so with IRQ enabled. It is legal to call skb_linearize > as long as you're sure that IRQs are enabled, which is always the > case for hard_start_xmit upon entry. > > So the only time you can't call it in hard_start_xmit is if you've > just disabled IRQs yourself. Or netconsole. netconsole calls start_xmit from IRQ but it is safe since netconsole doesn't send fragmented skb's From hartlch14 at gmail.com Fri Jan 30 08:37:29 2009 From: hartlch14 at gmail.com (Chuck Hartley) Date: Fri, 30 Jan 2009 11:37:29 -0500 Subject: [ofa-general] Unable to get IPoIB working Message-ID: I've still been unable to get IPoIB working. Could someone running FC9 verify that IPoIB is working under OFED 1.4 please? The problem seems to be that although ifconfig shows the ib0 interface as being up, using 'ip addr show ib0' displays "NO-CARRIER" and "state DOWN" (which is what is in /sys/class/net/ib0). I assume the IPoIB module is supposed to set that stuff appropriately, but it does not seem to be in this case. It would be helpful to know that it is working for someone using Fedora 9... Thanks, Chuck From sean.hefty at intel.com Fri Jan 30 09:28:43 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 30 Jan 2009 09:28:43 -0800 Subject: [ofa-general] RE: [ib-mgmt] duplicate definitions In-Reply-To: <20090130142817.GV8534@sashak.voltaire.com> References: <20090129202954.GQ8534@sashak.voltaire.com> <20090130142817.GV8534@sashak.voltaire.com> Message-ID: <4ACDBED370D843D68021495AD5159794@amr.corp.intel.com> >BTW is it true that WinOF uses its own version of ib_types.h? > >A lot of time ago I posted patch (below) which removes '#ifdef *WIN*' >stuff there. Would be interested to know is it relevant yet? WinOF has 2 copies of ib_types.h in the SVN tree, neither of which are sufficient as is for porting saquery. Personally, I'd like to replace both with a single copy that matches the ib_types.h file maintained in the git tree, and get all of the ib-management code to support both Windows and Linux. The code has diverged so much at this point, that it would just take some time reach that point. - Sean From rdreier at cisco.com Fri Jan 30 09:35:52 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 30 Jan 2009 09:35:52 -0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090130065721.GA4886@gondor.apana.org.au> (Herbert Xu's message of "Fri, 30 Jan 2009 17:57:21 +1100") References: <20090130065721.GA4886@gondor.apana.org.au> Message-ID: > > OK, thanks... what confused me is that several other drivers also do > > skb_linearize() in their hard_start_xmit method... eg bnx2x, > > via-velocity, mv643xx_eth. So there are several other lurking bugs to > > deal with here I guess. > > I don't know about the rest but bnx2x is certainly OK since it > only does so with IRQ enabled. It is legal to call skb_linearize > as long as you're sure that IRQs are enabled, which is always the > case for hard_start_xmit upon entry. I don't believe this is accurate. Calling skb_linearize() (on a kernel with CONFIG_HIGHMEM set) can end up calling local_bh_enable() in kunmap_skb_frag(), which can obviously cause problems if the initial context relies on having BHs disabled (as hard_start_xmit does). - R. From Jie.Cai at cs.anu.edu.au Thu Jan 29 22:53:12 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Fri, 30 Jan 2009 17:53:12 +1100 Subject: [ofa-general] Multiports single HCA uDAPL program problem In-Reply-To: <20090129200005.20863E61234@openfabrics.org> References: <20090129200005.20863E61234@openfabrics.org> Message-ID: <4982A3D8.5030503@cs.anu.edu.au> Hi All, I am kind of noob on IB and uDAPL program. Currently, I am trying to write a program with multirail that utilizes 2 ports on a single Mallenox ConnectX HCA on both nodes. OFED1.3 has been installed on a SUSE 10.3 linux system. The current problem is that IB connection via uDAPL are very unstable, and sometime the connection can't be established. Error message is usually like: 20350 Server waiting for connect request on port 45248 accept: ERR dev(0x61d0e0!=0x61d0e0) or port mismatch(1!=2) 20350 Error dat_cr_accept: DAT_INTERNAL_ERROR 20350 Error connect_ep: DAT_INTERNAL_ERROR The status of both port are active: hca_id: mlx4_0 fw_ver: 2.3.000 node_guid: 0003:ba00:0100:702c sys_image_guid: 0003:ba00:0100:702f vendor_id: 0x02c9 vendor_part_id: 25418 hw_ver: 0xA0 board_id: SUN0070000001 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 10 port_lid: 8 port_lmc: 0x00 port: 2 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 10 port_lid: 9 port_lmc: 0x00 I haven't done any specific configuration for multi-port. I assume that OFED1.3 can do it automatically. Would please any one help me on this? Regards, Jie -- Jie Cai From herbert at gondor.apana.org.au Thu Jan 29 22:57:21 2009 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Fri, 30 Jan 2009 17:57:21 +1100 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: Message-ID: <20090130065721.GA4886@gondor.apana.org.au> Roland Dreier wrote: > > OK, thanks... what confused me is that several other drivers also do > skb_linearize() in their hard_start_xmit method... eg bnx2x, > via-velocity, mv643xx_eth. So there are several other lurking bugs to > deal with here I guess. I don't know about the rest but bnx2x is certainly OK since it only does so with IRQ enabled. It is legal to call skb_linearize as long as you're sure that IRQs are enabled, which is always the case for hard_start_xmit upon entry. So the only time you can't call it in hard_start_xmit is if you've just disabled IRQs yourself. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From eilong at broadcom.com Fri Jan 30 00:22:42 2009 From: eilong at broadcom.com (Eilon Greenstein) Date: Fri, 30 Jan 2009 10:22:42 +0200 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090130065721.GA4886@gondor.apana.org.au> References: <20090130065721.GA4886@gondor.apana.org.au> Message-ID: <1233303762.26509.15.camel@lb-tlvb-eliezer> On Thu, 2009-01-29 at 22:57 -0800, Herbert Xu wrote: > Roland Dreier wrote: > > > > OK, thanks... what confused me is that several other drivers also do > > skb_linearize() in their hard_start_xmit method... eg bnx2x, > > via-velocity, mv643xx_eth. So there are several other lurking bugs to > > deal with here I guess. > > I don't know about the rest but bnx2x is certainly OK since it > only does so with IRQ enabled. It is legal to call skb_linearize > as long as you're sure that IRQs are enabled, which is always the > case for hard_start_xmit upon entry. > > So the only time you can't call it in hard_start_xmit is if you've > just disabled IRQs yourself. > Thanks Herbert, That was my conclusion too and even though I was still looking at it since this report yesterday - it still looks OK to me. The bnx2x is getting into this flow when using SAMBA and it was tested on few systems for days under traffic - this does not mean that the code is right and it is not a prove that there is no bug - but it makes me feel better... I would appreciate some comments if someone still thinks this is a bug. Thanks, Eilon From sean.hefty at intel.com Fri Jan 30 10:51:15 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 30 Jan 2009 10:51:15 -0800 Subject: [ofa-general] [PATCH 1/5] [DAPL] dapl: sync with WinOF tree Message-ID: The common DAPL codebase shared between Windows and Linux has diverged. Resync to get back to common code that builds on both operating systems. Signed-off-by: Sean Hefty --- dapl/common/dapl_adapter_util.h | 2 +- dapl/common/dapl_debug.c | 2 +- dapl/common/dapl_evd_util.c | 4 ---- dapl/common/dapl_sp_util.c | 4 ++-- dapl/include/dapl.h | 2 +- dapl/udapl/linux/dapl_osd.h | 15 +++++++++++++++ dapl/udapl/windows/dapl_osd.h | 12 ++++++++++-- dat/udat/SOURCES | 6 +++++- dat/udat/udat_exports.src | 2 ++ dat/udat/windows/dat_osd.c | 10 ++++------ dat/udat/windows/dat_osd.h | 4 ++-- 11 files changed, 43 insertions(+), 20 deletions(-) diff --git a/dapl/common/dapl_adapter_util.h b/dapl/common/dapl_adapter_util.h index c5bf5da..e3069f8 100755 --- a/dapl/common/dapl_adapter_util.h +++ b/dapl/common/dapl_adapter_util.h @@ -247,7 +247,7 @@ int dapls_ib_private_data_size ( void dapls_query_provider_specific_attr( IN DAPL_IA *ia_ptr, - IN DAT_PROVIDER_ATTR *provider_attr ); + IN DAT_PROVIDER_ATTR *attr_ptr ); #ifdef CQ_WAIT_OBJECT DAT_RETURN diff --git a/dapl/common/dapl_debug.c b/dapl/common/dapl_debug.c index cbc356c..e717591 100644 --- a/dapl/common/dapl_debug.c +++ b/dapl/common/dapl_debug.c @@ -53,7 +53,7 @@ void dapl_internal_dbg_log ( DAPL_DBG_TYPE type, const char *fmt, ...) if ( DAPL_DBG_DEST_STDOUT & g_dapl_dbg_dest ) { va_start (args, fmt); - fprintf(stdout, "%s:%d: ", _ptr_host_, getpid()); + fprintf(stdout, "%s:%d: ", _ptr_host_, dapl_os_getpid()); dapl_os_vprintf (fmt, args); va_end (args); } diff --git a/dapl/common/dapl_evd_util.c b/dapl/common/dapl_evd_util.c index 1d13ce0..c6c7463 100644 --- a/dapl/common/dapl_evd_util.c +++ b/dapl/common/dapl_evd_util.c @@ -48,10 +48,6 @@ #include "dapl_sp_util.h" #include "dapl_ep_util.h" -#include -#include -#include - STATIC _INLINE_ void dapli_evd_eh_print_cqe ( IN ib_work_completion_t *cqe); diff --git a/dapl/common/dapl_sp_util.c b/dapl/common/dapl_sp_util.c index 5ac0660..cdebc67 100644 --- a/dapl/common/dapl_sp_util.c +++ b/dapl/common/dapl_sp_util.c @@ -98,7 +98,7 @@ dapls_sp_alloc ( dapl_llist_init_entry (&sp_ptr->header.ia_list_entry); dapl_os_lock_init (&sp_ptr->header.lock); -#if defined(_WIN32) || defined(_WIN64) +#if defined(_VENDOR_IBAL_) dapl_os_wait_object_init( &sp_ptr->wait_object ); #endif /* @@ -133,7 +133,7 @@ dapls_sp_free_sp ( sp_ptr->header.magic == DAPL_MAGIC_RSP); dapl_os_assert (dapl_llist_is_empty (&sp_ptr->cr_list_head)); -#if defined(_WIN32) || defined(_WIN64) +#if defined(_VENDOR_IBAL_) dapl_os_wait_object_destroy( &sp_ptr->wait_object ); #endif dapl_os_lock (&sp_ptr->header.lock); diff --git a/dapl/include/dapl.h b/dapl/include/dapl.h index 58af95d..e2025ce 100755 --- a/dapl/include/dapl.h +++ b/dapl/include/dapl.h @@ -543,7 +543,7 @@ struct dapl_sp ib_cm_srvc_handle_t cm_srvc_handle; /* Used by Mellanox CM */ DAPL_LLIST_HEAD cr_list_head; /* CR pending queue */ DAT_COUNT cr_list_count; /* count of CRs on queue */ -#if _VENDOR_IBAL_ +#if defined(_VENDOR_IBAL_) DAPL_OS_WAIT_OBJECT wait_object; /* cancel & destroy. */ #endif }; diff --git a/dapl/udapl/linux/dapl_osd.h b/dapl/udapl/linux/dapl_osd.h index 42ced41..6fef9af 100644 --- a/dapl/udapl/linux/dapl_osd.h +++ b/dapl/udapl/linux/dapl_osd.h @@ -67,17 +67,31 @@ #include #include #include /* for getaddrinfo */ +#include + +#include /* for IOCTL's */ #include "dapl_debug.h" /* * Include files for setting up a network name */ +#include /* for socket(2) */ +#include /* for struct ifreq */ +#include /* for ARPHRD_ETHER */ #include #include #include +#include +#include #include +#include +#include +#include +#include +#include + #if !defined(REDHAT_EL5) && (defined(__ia64__)) #include #endif @@ -543,6 +557,7 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) #define dapl_os_syslog(fmt,args) vsyslog(LOG_USER|LOG_WARNING,fmt,args) +#define dapl_os_getpid getpid #endif /* _DAPL_OSD_H_ */ diff --git a/dapl/udapl/windows/dapl_osd.h b/dapl/udapl/windows/dapl_osd.h index 9ed2559..43f70ee 100644 --- a/dapl/udapl/windows/dapl_osd.h +++ b/dapl/udapl/windows/dapl_osd.h @@ -56,6 +56,7 @@ #include #include #include +#include #include #pragma warning ( pop ) @@ -84,6 +85,9 @@ exit(1); \ } +#define openlog(...) +#define closelog(...) + /* * Atomic operations */ @@ -459,7 +463,7 @@ typedef unsigned long DAPL_OS_TICKS; */ STATIC __inline void dapl_os_sleep_usec (int sleep_time) { - Sleep(sleep_time/1000); + Sleep(sleep_time/1000); // convert to milliseconds } STATIC __inline DAPL_OS_TICKS dapl_os_get_ticks (void); @@ -513,6 +517,11 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) return strtol(nptr, endptr, base); } +STATIC __inline int +dapl_os_getpid(void) +{ + return (int)GetCurrentProcessId(); +} /* * Debug Helper Functions @@ -524,7 +533,6 @@ dapl_os_strtol(const char *nptr, char **endptr, int base) #define dapl_os_vprintf(fmt,args) vprintf(fmt,args) #define dapl_os_syslog(fmt,args) /* XXX Need log routine call */ - #endif /* _DAPL_OSD_H_ */ /* diff --git a/dat/udat/SOURCES b/dat/udat/SOURCES index a205ebc..7d37f65 100644 --- a/dat/udat/SOURCES +++ b/dat/udat/SOURCES @@ -6,8 +6,12 @@ TARGETNAME=dat2d TARGETPATH=..\..\..\..\bin\user\obj$(BUILD_ALT_DIR) TARGETTYPE=DYNLINK DLLENTRY=_DllMainCRTStartup +!if $(_NT_TOOLS_VERSION) == 0x700 DLLDEF=$O\udat_exports.def -USE_CRTDLL=1 +!else +DLLDEF=$(OBJ_PATH)\$O\udat_exports.def +!endif +USE_MSVCRT=1 SOURCES=udat.rc \ udat.c \ diff --git a/dat/udat/udat_exports.src b/dat/udat/udat_exports.src index 5493be4..42e3773 100644 --- a/dat/udat/udat_exports.src +++ b/dat/udat/udat_exports.src @@ -42,6 +42,7 @@ dat_pz_free dat_pz_query dat_registry_add_provider dat_registry_remove_provider +dat_registry_list_providers dat_rmr_bind dat_rmr_create dat_rmr_free @@ -50,6 +51,7 @@ dat_rsp_create dat_rsp_free dat_rsp_query dat_strerror +dats_get_ia_handle #ifdef DAT_EXTENSIONS dat_extension_op #endif diff --git a/dat/udat/windows/dat_osd.c b/dat/udat/windows/dat_osd.c index 9f21e8f..5b57f43 100644 --- a/dat/udat/windows/dat_osd.c +++ b/dat/udat/windows/dat_osd.c @@ -113,16 +113,14 @@ dat_os_dbg_print ( { va_list args; - va_start(args, fmt); - if ( DAT_OS_DBG_DEST_STDOUT & g_dbg_dest ) { + va_start(args, fmt); vfprintf(stdout, fmt, args); - } - - va_end(args); - fflush(stdout); + va_end(args); + } + /* no syslog() susport in Windows */ } } diff --git a/dat/udat/windows/dat_osd.h b/dat/udat/windows/dat_osd.h index 2e18988..d78fe44 100644 --- a/dat/udat/windows/dat_osd.h +++ b/dat/udat/windows/dat_osd.h @@ -48,8 +48,8 @@ #error "UNDEFINED OS TYPE" #endif /* WIN32/64 */ -#include -#include +#include +#include #include #include From sean.hefty at intel.com Fri Jan 30 10:54:30 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 30 Jan 2009 10:54:30 -0800 Subject: [ofa-general] [PATCH 2/5] [DAPL] dapl/dapltest: sync OFED and WinOF code In-Reply-To: References: Message-ID: From: Stan Smith Several changes have been made to dapltest on the WinOF tree that are not available to OFED. Resync dapltest to return to a common codebase. Main features are: Add return codes to various functions. Add batch file for Windows testing. Signed-off-by: Sean Hefty --- Actual changes are from previous work done by Stan. I'm simply resync'ing the code. test/dapltest/cmd/dapl_main.c | 11 ++- test/dapltest/common/dapl_endian.c | 1 test/dapltest/dapltest.rc | 50 ------------ test/dapltest/dirs | 1 test/dapltest/include/dapl_execute.h | 2 test/dapltest/include/dapl_proto.h | 14 ++- test/dapltest/include/dapl_tdep.h | 2 test/dapltest/makefile.wnd | 7 -- test/dapltest/mdep/windows/dapl_mdep_user.c | 7 +- test/dapltest/mdep/windows/dapl_mdep_user.h | 2 test/dapltest/scripts/dt-cli.bat | 101 ++++++++++++++++++++++--- test/dapltest/test/dapl_client.c | 94 +++++++++++++---------- test/dapltest/test/dapl_execute.c | 22 +++-- test/dapltest/test/dapl_fft_test.c | 5 + test/dapltest/test/dapl_limit.c | 107 ++++++++++++++------------ test/dapltest/test/dapl_performance_client.c | 5 + test/dapltest/test/dapl_transaction_test.c | 21 +---- test/dapltest/udapl/udapl_tdep.c | 4 - test/dapltest/windows/SOURCES | 33 ++++++++ test/dapltest/windows/dapltest.rc | 50 ++++++++++++ test/dapltest/windows/makefile | 7 ++ 21 files changed, 338 insertions(+), 208 deletions(-) diff --git a/test/dapltest/cmd/dapl_main.c b/test/dapltest/cmd/dapl_main.c index b6be403..c562375 100644 --- a/test/dapltest/cmd/dapl_main.c +++ b/test/dapltest/cmd/dapl_main.c @@ -45,6 +45,7 @@ int dapltest (int argc, char *argv[]) { Params_t *params_ptr; + DAT_RETURN rc=DAT_SUCCESS; /* check memory leaking */ /* @@ -97,7 +98,7 @@ dapltest (int argc, char *argv[]) } params_ptr->cpu_mhz = DT_Mdep_GetCpuMhz (); /* call the test-dependent code for invoking the actual test */ - DT_Tdep_Execute_Test (params_ptr); + rc = DT_Tdep_Execute_Test (params_ptr); /* cleanup */ @@ -116,7 +117,7 @@ dapltest (int argc, char *argv[]) * alloc_count); DT_Mdep_LockDestroy(&Alloc_Count_Lock); */ - return ( 0 ); + return ( rc ); } @@ -124,7 +125,7 @@ void Dapltest_Main_Usage (void) { DT_Mdep_printf ("USAGE:\n"); - DT_Mdep_printf ("USAGE: dapltest -T [test-specific args]\n"); + DT_Mdep_printf ("USAGE: dapltest -T [-D IA_name] [test-specific args]\n"); DT_Mdep_printf ("USAGE: where \n"); DT_Mdep_printf ("USAGE: S = Run as a server\n"); DT_Mdep_printf ("USAGE: T = Transaction Test\n"); @@ -133,7 +134,9 @@ Dapltest_Main_Usage (void) DT_Mdep_printf ("USAGE: L = Limit Test\n"); DT_Mdep_printf ("USAGE: F = FFT Test\n"); DT_Mdep_printf ("USAGE:\n"); - DT_Mdep_printf ("NOTE:\tRun as server taking defaults (dapltest -T S)\n"); + DT_Mdep_printf ("USAGE: -D Interface_Adapter {default ibnic0v2}\n"); + DT_Mdep_printf ("USAGE:\n"); + DT_Mdep_printf ("NOTE:\tRun as server taking defaults (dapltest -T S [-D ibnic0v2])\n"); DT_Mdep_printf ("NOTE: dapltest\n"); DT_Mdep_printf ("NOTE:\n"); DT_Mdep_printf ("NOTE:\tdapltest arguments may be supplied in a script file\n"); diff --git a/test/dapltest/common/dapl_endian.c b/test/dapltest/common/dapl_endian.c index 84ff573..5c1846f 100644 --- a/test/dapltest/common/dapl_endian.c +++ b/test/dapltest/common/dapl_endian.c @@ -94,6 +94,7 @@ DAT_UINT64 DT_EndianMemAddress (DAT_UINT64 val) { DAT_UINT64 val64; + if (DT_local_is_little_endian) return val; val64 = val; diff --git a/test/dapltest/dapltest.rc b/test/dapltest/dapltest.rc deleted file mode 100644 index 09fef62..0000000 --- a/test/dapltest/dapltest.rc +++ /dev/null @@ -1,50 +0,0 @@ -/* - * Copyright (c) 2007 Intel Corporation. All rights reserved. - * - * This software is available to you under the OpenIB.org BSD license - * below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - - -#include - -#define VER_FILETYPE VFT_APP -#define VER_FILESUBTYPE VFT2_UNKNOWN - -#if DBG -#define VER_FILEDESCRIPTION_STR "DAPL/DAT[2.0] test Application (Debug)" -#define VER_INTERNALNAME_STR "dapl2testd.exe" -#define VER_ORIGINALFILENAME_STR "dapl2testd.exe" - -#else -#define VER_FILEDESCRIPTION_STR "DAPL/DAT[2.0] test Application" -#define VER_INTERNALNAME_STR "dapl2test.exe" -#define VER_ORIGINALFILENAME_STR "dapl2test.exe" - -#endif - -#include diff --git a/test/dapltest/dirs b/test/dapltest/dirs new file mode 100644 index 0000000..f3b2bb0 --- /dev/null +++ b/test/dapltest/dirs @@ -0,0 +1 @@ +dirs = windows diff --git a/test/dapltest/include/dapl_execute.h b/test/dapltest/include/dapl_execute.h index a5992f2..eb2d54a 100644 --- a/test/dapltest/include/dapl_execute.h +++ b/test/dapltest/include/dapl_execute.h @@ -34,7 +34,7 @@ #include "dapl_proto.h" #include "dapl_params.h" -void +DAT_RETURN DT_Execute_Test ( Params_t *params_ptr ) ; #endif diff --git a/test/dapltest/include/dapl_proto.h b/test/dapltest/include/dapl_proto.h index 9de42e2..d8be354 100644 --- a/test/dapltest/include/dapl_proto.h +++ b/test/dapltest/include/dapl_proto.h @@ -67,6 +67,8 @@ #include "dapl_transaction_stats.h" #include "dapl_version.h" +#define DAT_ERROR(Type,SubType) ((DAT_RETURN)(DAT_CLASS_ERROR | Type | SubType)) + /* * Prototypes */ @@ -101,7 +103,7 @@ int get_ep_connection_state (DT_Tdep_Print_Head* phead, DAT_EP_HANDLE ep_handle); /* dapl_client.c */ -void DT_cs_Client (Params_t * params_ptr, +DAT_RETURN DT_cs_Client (Params_t * params_ptr, char *dapl_name, char *server_name, DAT_UINT32 total_threads); @@ -237,7 +239,7 @@ void DT_Performance_Cmd_PT_Print (DT_Tdep_Print_Head* phead, void DT_Performance_Cmd_Endian (Performance_Cmd_t * cmd); /* dapl_performance_client.c */ -void DT_Performance_Test_Client ( Params_t *params_ptr, +DAT_RETURN DT_Performance_Test_Client ( Params_t *params_ptr, Per_Test_Data_t * pt_ptr, DAT_IA_HANDLE * ia_handle, DAT_IA_ADDRESS_PTR remote); @@ -249,7 +251,7 @@ bool DT_Performance_Test_Client_Connect ( bool DT_Performance_Test_Client_Exchange ( Params_t *params_ptr, DT_Tdep_Print_Head *phead, - Performance_Test_t *test_ptr); + Performance_Test_t *test_ptr ); /* dapl_performance_server.c */ void DT_Performance_Test_Server (void * pt_ptr); @@ -467,7 +469,7 @@ void DT_Transaction_Cmd_PT_Print (DT_Tdep_Print_Head* phead, void DT_Transaction_Cmd_Endian (Transaction_Cmd_t * cmd, bool to_wire); /* dapl_transaction_test.c */ -void DT_Transaction_Test_Client (Per_Test_Data_t * pt_ptr, +DAT_RETURN DT_Transaction_Test_Client (Per_Test_Data_t * pt_ptr, DAT_IA_HANDLE ia_handle, DAT_IA_ADDRESS_PTR remote); @@ -557,7 +559,7 @@ bool DT_Limit_Cmd_Parse ( Limit_Cmd_t * cmd, void DT_Limit_Cmd_Usage (void); /* dapl_limit.c */ -void DT_cs_Limit (Params_t *params, Limit_Cmd_t * cmd); +DAT_RETURN DT_cs_Limit (Params_t *params, Limit_Cmd_t * cmd); /* dapl_fft_cmd.c */ void DT_FFT_Cmd_Init ( FFT_Cmd_t * cmd); @@ -570,7 +572,7 @@ bool DT_FFT_Cmd_Parse ( FFT_Cmd_t * cmd, void DT_FFT_Cmd_Usage (void); /* dapl_fft_test.c */ -void DT_cs_FFT (Params_t *params, FFT_Cmd_t * cmd); +DAT_RETURN DT_cs_FFT (Params_t *params, FFT_Cmd_t * cmd); /* dapl_fft_hwconn.c */ void DT_hwconn_test (Params_t *params_ptr, FFT_Cmd_t *cmd); diff --git a/test/dapltest/include/dapl_tdep.h b/test/dapltest/include/dapl_tdep.h index bf314c3..ccddce5 100644 --- a/test/dapltest/include/dapl_tdep.h +++ b/test/dapltest/include/dapl_tdep.h @@ -44,7 +44,7 @@ DT_Tdep_Init ( void ) ; void DT_Tdep_End ( void ) ; -void +DAT_RETURN DT_Tdep_Execute_Test ( Params_t *params_ptr ) ; DAT_RETURN diff --git a/test/dapltest/makefile.wnd b/test/dapltest/makefile.wnd deleted file mode 100644 index e26e1c0..0000000 --- a/test/dapltest/makefile.wnd +++ /dev/null @@ -1,7 +0,0 @@ -# -# DO NOT EDIT THIS FILE!!! Edit .\sources. if you want to add a new source -# file to this component. This file merely indirects to the real make file -# that is shared by all the driver components of the OpenIB Windows project. -# - -!INCLUDE ..\..\..\..\inc\openib.def diff --git a/test/dapltest/mdep/windows/dapl_mdep_user.c b/test/dapltest/mdep/windows/dapl_mdep_user.c index 0b9321c..afc12d3 100644 --- a/test/dapltest/mdep/windows/dapl_mdep_user.c +++ b/test/dapltest/mdep/windows/dapl_mdep_user.c @@ -140,7 +140,7 @@ DT_Mdep_GetCpuMhz ( unsigned long -DT_Mdep_GetContextSwitchNum (void ) +DT_Mdep_GetContextSwitchNum (void) { return 0; } @@ -273,7 +273,7 @@ DT_Mdep_Thread_Start_Routine (void *thread_handle) * interface to clean up resources properly at * thread's end. */ -void DT_Mdep_Thread_Detach ( int thread_id ) /* AMM */ +void DT_Mdep_Thread_Detach (DT_Mdep_ThreadHandleType thread_id ) /* AMM */ { } @@ -283,9 +283,8 @@ void DT_Mdep_Thread_Detach ( int thread_id ) /* AMM */ * upon themselves. */ -int DT_Mdep_Thread_SELF (void) /* AMM */ +DT_Mdep_ThreadHandleType DT_Mdep_Thread_SELF (void) /* AMM */ { - return 0; } diff --git a/test/dapltest/mdep/windows/dapl_mdep_user.h b/test/dapltest/mdep/windows/dapl_mdep_user.h index 6dd3f7f..7b7a845 100644 --- a/test/dapltest/mdep/windows/dapl_mdep_user.h +++ b/test/dapltest/mdep/windows/dapl_mdep_user.h @@ -48,7 +48,7 @@ # include /* Default Device Name */ -#define DT_MdepDeviceName "ibnic0" +#define DT_MdepDeviceName "ibnic0v2" /* Boolean */ typedef int bool; diff --git a/test/dapltest/scripts/dt-cli.bat b/test/dapltest/scripts/dt-cli.bat index acd06df..ddf5eaa 100644 --- a/test/dapltest/scripts/dt-cli.bat +++ b/test/dapltest/scripts/dt-cli.bat @@ -5,6 +5,15 @@ rem SETLOCAL +rem cmd.exe /V:on (delayed environment variable expansion) is required! +rem restart with /V:on if necessary +set F=on +set F=off +if not "!F!" == "off" ( + %comspec% /E:on /V:on /C %0 %1 %2 %3 %4 + exit /B %ERRORLEVEL% +) + rem set DAT_OVERRIDE=D:\dapl2\dat.conf rem favor DAT 2.0 (dapl2test.exe) over DAT 1.1 (dapltest.exe) @@ -19,6 +28,7 @@ if "%0" == "dt-cli" ( if EXIST %PF%\dapl2test.exe ( set DT=dapl2test.exe set D=ibnic0v2 +rem To debug dapl2test - use dapl2testd.exe with ibnic0v2d goto OK ) ) @@ -78,55 +88,88 @@ rem client SR 1024 3 -f server SR 256 3 -f if "%T%" == "conn" ( rem Connectivity test - client sends one buffer with one 4KB segments, one time. rem add '-d' for debug output. + echo Simple Connectivity test %DT% -T T -s %S% -D %D% -i 1 -t 1 -w 1 client SR 4096 server SR 4096 exit /B ) if "%T%" == "trans" ( - echo Transaction test - 8192 iterations, 1 thread, SR 4KB buffers + echo %T%: Transaction test - 8192 iterations, 1 thread, SR 4KB buffers %DT% -T T -s %S% -D %D% -i 8192 -t 1 -w 1 client SR 4096 server SR 4096 + echo Finished %T%: Transaction test - 8192 iterations, 1 thread, SR 4KB buffers exit /B ) if "%T%" == "transm" ( - echo Multiple RW, RR, SR transactions, 4096 iterations + echo %T%: Multiple RW, RR, SR transactions, 4096 iterations %DT% -T T -P -t 1 -w 1 -i 4096 -s %S% -D %D% client RW 4096 1 server RW 2048 4 server RR 1024 2 client RR 2048 2 client SR 1024 3 -f server SR 256 3 -f + echo Finished %T%: Multiple RW, RR, SR transactions, 4096 iterations exit /B ) if "%T%" == "transt" ( - echo Multi-threaded[4] Transaction test - 4096 iterations, 1 thread, SR 4KB buffers + echo %T%: Threads[4] Transaction test - 4096 iterations, 1 thread, SR 4KB buffers %DT% -T T -s %S% -D %D% -i 4096 -t 4 -w 1 client SR 8192 3 server SR 8192 3 + echo Finished %T%: Threads[4] Transaction test - 4096 iterations, 1 thread, SR 4KB buffers exit /B ) if "%T%" == "transme" ( - echo Multiple endpoints[4] transactions [RW, RR, SR], 4096 iterations + echo %T%: 1 Thread Endpoints[4] transactions [RW, RR, SR], 4096 iterations %DT% -T T -P -t 1 -w 4 -i 4096 -s %S% -D %D% client RW 4096 1 server RW 2048 4 server RR 1024 2 client RR 2048 2 client SR 1024 3 -f server SR 256 3 -f + echo Finished %T%: 1 Thread Endpoints[4] transactions [RW, RR, SR], 4096 iterations exit /B ) if "%T%" == "transmet" ( - echo Multiple: threads[2] endpoints[4] transactions[RW, RR, SR], 4096 iterations + echo %T%: Threads[2] Endpoints[4] transactions[RW, RR, SR], 4096 iterations %DT% -T T -P -t 2 -w 4 -i 4096 -s %S% -D %D% client RW 4096 1 server RW 2048 4 server RR 1024 2 client RR 2048 2 client SR 1024 3 -f server SR 256 3 -f + echo Finished %T%: Threads[2] Endpoints[4] transactions[RW, RR, SR], 4096 iterations exit /B ) if "%T%" == "transmete" ( - echo Multiple: threads[4] endpoints[4] transactions[RW, RR, SR], 8192 iterations + echo %T%: Threads[4] Endpoints[4] transactions[RW, RR, SR], 8192 iterations %DT% -T T -P -t 2 -w 4 -i 8192 -s %S% -D %D% client RW 4096 1 server RW 2048 4 server RR 1024 2 client RR 2048 2 client SR 1024 3 -f server SR 256 3 -f + echo Finished %T%: Threads[4] Endpoints[4] transactions[RW, RR, SR], 8192 iterations + exit /B +) + +if "%T%" == "EPA" ( + FOR /L %%j IN (2,1,5) DO ( + FOR /L %%i IN (1,1,5) DO ( + echo %T%: Multi: Threads[%%j] Endpoints[%%i] Send/Recv test - 4096 iterations, 3 8K segs + %DT% -T T -s %S% -D %D% -i 4096 -t %%j -w %%i client SR 8192 3 server SR 8192 3 + if ERRORLEVEL 1 exit /B + echo %T%: Multi: Threads[%%j] Endpoints[%%i] Send/Recv test - 4096 iterations, 3 8K segs + timeout /T 3 + ) + ) + exit /B +) + +if "%T%" == "EP" ( + set TH=4 + set EP=5 + echo %T%: Multi: Threads[!TH!] endpoints[!EP!] Send/Recv test - 4096 iterations, 3 8K segs + %DT% -T T -s %S% -D %D% -i 4096 -t !TH! -w !EP! client SR 8192 3 server SR 8192 3 + echo %T%: Multi: Threads[!TH!] endpoints[!EP!] Send/Recv test - 4096 iterations, 3 8K segs exit /B ) if "%T%" == "threads" ( - echo Multi Threaded[6] Send/Recv test - 4096 iterations, 3 8K segs + echo %T%: Multi Threaded[6] Send/Recv test - 4096 iterations, 3 8K segs %DT% -T T -s %S% -D %D% -i 4096 -t 6 -w 1 client SR 8192 3 server SR 8192 3 + echo Finished %T%: Multi Threaded[6] Send/Recv test - 4096 iterations, 3 8K segs exit /B ) if "%T%" == "threadsm" ( - echo Multi: Threads[6] endpoints[6] Send/Recv test - 4096 iterations, 3 8K segs + set TH=4 + set EP=5 + echo %T%: Multi: Threads[!TH!] endpoints[!EP!] Send/Recv test - 4096 iterations, 3 8K segs %DT% -T T -s %S% -D %D% -i 4096 -t 6 -w 6 client SR 8192 3 server SR 8192 3 + echo Finished %T%: Multi: Threads[!TH!] endpoints[!EP!] Send/Recv test - 4096 iterations, 3 8K segs exit /B ) @@ -137,18 +180,23 @@ if "%T%" == "perf" ( ) if "%T%" == "rdma-read" ( + echo %T% 4 32K segs %DT% -T P -s %S% -D %D% -i 4096 RR 32768 4 + echo Finished %T% 4 32K segs exit /B ) if "%T%" == "rdma-write" ( + echo %T% 4 32K segs %DT% -T P -s %S% -D %D% -i 4096 RW 32768 4 + echo Finished %T% 4 32K segs exit /B ) if "%T%" == "bw" ( - echo bandwidth 65K msgs + echo bandwidth 4096 iterations of 2 65K mesgs %DT% -T P -s %S% -D %D% -i 4096 -p 16 -m p RW 65536 2 + echo Finished bandwidth 4096 iterations of 2 65K mesgs exit /B ) @@ -190,43 +238,61 @@ if "%T%" == "regression" ( echo %T% testing in %L% Loops REM rdma-write, read, perf FOR /L %%i IN (1,1,%L%) DO ( + call %0 %1 trans if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 perf if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 threads if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 threadsm if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 transm if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 transt if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 transme if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 transmet if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 transmete if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 rdma-write if ERRORLEVEL 1 exit /B timeout /T 3 + echo in Loop %%i call %0 %1 rdma-read if ERRORLEVEL 1 exit /B + echo in Loop %%i call %0 %1 bw if ERRORLEVEL 1 exit /B - echo %%i %T% loops completed. + + echo in Loop %%i + call %0 %1 EP + if ERRORLEVEL 1 exit /B + + echo Finished loop %%i, %T% loops completed. + timeout /T 4 ) exit /B ) @@ -245,36 +311,47 @@ if "%T%" == "interop" ( echo %T% testing in %L% Loops REM test units from Nov-'07 OFA interop event FOR /L %%i IN (0,1,1) DO ( + echo %DT% -T T -s %S% -D %D% -i 4096 -t 1 -w 1 -R BE client SR 256 1 server SR 256 1 %DT% -T T -s %S% -D %D% -i 4096 -t 1 -w 1 -R BE client SR 256 1 server SR 256 1 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 1 -V -P -R BE client SR 1024 3 -f server SR 1536 2 -f %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 1 -V -P -R BE client SR 1024 3 -f server SR 1536 2 -f if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 1 -V -P -R BE client SR 1024 1 server SR 1024 1 %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 1 -V -P -R BE client SR 1024 1 server SR 1024 1 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 10 -V -P -R BE client SR 1024 3 server SR 1536 2 %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 10 -V -P -R BE client SR 1024 3 server SR 1536 2 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 1 -V -P -R BE client SR 256 1 server RW 4096 1 server SR 256 1 %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 1 -V -P -R BE client SR 256 1 server RW 4096 1 server SR 256 1 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 1 -V -P -R BE client SR 256 1 server RR 4096 1 server SR 256 1 %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 1 -V -P -R BE client SR 256 1 server RR 4096 1 server SR 256 1 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T T -s %S% -D %D% -i 100 -t 4 -w 8 -V -P -R BE client SR 256 1 server RR 4096 1 server SR 256 1 client SR 256 1 server RR 4096 1 server SR 256 1 %DT% -T T -s %S% -D %D% -i 100 -t 4 -w 8 -V -P -R BE client SR 256 1 server RR 4096 1 server SR 256 1 client SR 256 1 server RR 4096 1 server SR 256 1 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T P -s %S% -D %D% -i 1024 -p 64 -m p RW 8192 2 %DT% -T P -s %S% -D %D% -i 1024 -p 64 -m p RW 8192 2 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T P -s %S% -D %D% -i 1024 -p 64 -m p RW 4096 2 %DT% -T P -s %S% -D %D% -i 1024 -p 64 -m p RW 4096 2 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T P -s %S% -D %D% -i 1024 -p 64 -m p RW 4096 1 %DT% -T P -s %S% -D %D% -i 1024 -p 64 -m p RW 4096 1 if ERRORLEVEL 1 exit /B timeout /T 3 + echo %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 10 -V -P -R BE client SR 1024 3 server SR 1536 2 %DT% -T T -s %S% -D %D% -i 100 -t 1 -w 10 -V -P -R BE client SR 1024 3 server SR 1536 2 if ERRORLEVEL 1 exit /B echo %%i %T% loops completed. @@ -290,7 +367,9 @@ if "%T%" == "stop" ( echo usage: dt-cli hostname [testname [-D]] echo where testname echo stop - request DAPLtest server to exit. -echo conn - simple connection with limited dater transfer +echo conn - simple connection test with limited data transfer +echo EP - Multiple EndPoints(7) and Threads(5) Transactions +echo EPA - Increment EndPoints[1..5] while increasing threads[1-5] echo trans - single transaction test echo transm - transaction test: multiple transactions [RW SND, RDMA] echo transt - transaction test: multi-threaded diff --git a/test/dapltest/test/dapl_client.c b/test/dapltest/test/dapl_client.c index 732c4c6..eaefc3d 100644 --- a/test/dapltest/test/dapl_client.c +++ b/test/dapltest/test/dapl_client.c @@ -41,7 +41,7 @@ * Client control routine Connect to the server, send the command across. * Then start the client-side of the test - creating threads as needed */ -void +DAT_RETURN DT_cs_Client (Params_t * params_ptr, char *dapl_name, char *server_name, @@ -69,7 +69,7 @@ DT_cs_Client (Params_t * params_ptr, DAT_DTO_COMPLETION_EVENT_DATA dto_stat; DAT_EVENT_NUMBER event_num; unsigned char * buffp; - DAT_RETURN ret; + DAT_RETURN ret, rc; DT_Tdep_Print_Head *phead; phead = params_ptr->phead; @@ -83,7 +83,7 @@ DT_cs_Client (Params_t * params_ptr, if (!pt_ptr) { DT_Tdep_PT_Printf (phead, "%s: no memory for Per_Test_Data\n", module); - return; + return DAT_INSUFFICIENT_RESOURCES; } DT_MemListInit (pt_ptr); /* init MemlistLock and memListHead */ DT_Thread_Init (pt_ptr); /* init ThreadLock and threadcount */ @@ -199,6 +199,7 @@ DT_cs_Client (Params_t * params_ptr, if (!DT_query (pt_ptr, ia_handle, ep_handle) || !DT_check_params (pt_ptr, module)) { + ret = DAT_INSUFFICIENT_RESOURCES; goto client_exit; } @@ -218,6 +219,7 @@ DT_cs_Client (Params_t * params_ptr, DT_Tdep_PT_Printf (phead, "%s: no memory for command buffer pool.\n", module); + ret = DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,DAT_RESOURCE_MEMORY); goto client_exit; } @@ -238,11 +240,12 @@ retry_repost: ep_handle, bpool, 0, - DT_Bpool_GetBuffSize (bpool, 0))) + DT_Bpool_GetBuffSize (bpool, 0)) ) { DT_Tdep_PT_Printf (phead, "%s: cannot post Server_Info recv buffer.\n", module); + ret = DAT_INSUFFICIENT_RESOURCES; goto client_exit; } @@ -284,11 +287,9 @@ retry: dat_ep_reset (ep_handle); do { - - ret = DT_Tdep_evd_dequeue ( recv_evd_hdl, - &event); + rc = DT_Tdep_evd_dequeue ( recv_evd_hdl, &event); drained++; - } while (DAT_GET_TYPE(ret) != DAT_QUEUE_EMPTY); + } while (DAT_GET_TYPE(rc) != DAT_QUEUE_EMPTY); if (drained > 1 && retry_cnt < MAX_CONN_RETRY) { @@ -300,6 +301,7 @@ retry: goto retry; } } + ret = DAT_INSUFFICIENT_RESOURCES; DT_Tdep_PT_Printf (phead, "%s: bad connection event\n", module); goto client_exit; } @@ -328,9 +330,10 @@ retry: ep_handle, bpool, 1, - DT_Bpool_GetBuffSize (bpool, 1))) + DT_Bpool_GetBuffSize (bpool, 1)) ) { DT_Tdep_PT_Printf (phead, "%s: cannot send Client_Info\n", module); + ret = DAT_INSUFFICIENT_RESOURCES; goto client_exit; } /* reap the send and verify it */ @@ -344,8 +347,9 @@ retry: ep_handle, DT_Bpool_GetBuffSize (bpool, 1), dto_cookie, - "Client_Info_Send")) + "Client_Info_Send") ) { + ret = DAT_INSUFFICIENT_RESOURCES; goto client_exit; } @@ -386,6 +390,7 @@ retry: default: { DT_Tdep_PT_Printf (phead, "Unknown Test Type\n"); + ret = DAT_INVALID_PARAMETER; goto client_exit; } } @@ -395,9 +400,10 @@ retry: ep_handle, bpool, 2, - DT_Bpool_GetBuffSize (bpool, 2))) + DT_Bpool_GetBuffSize (bpool, 2)) ) { DT_Tdep_PT_Printf (phead, "%s: cannot send Command\n", module); + ret = DAT_INSUFFICIENT_RESOURCES; goto client_exit; } /* reap the send and verify it */ @@ -412,8 +418,9 @@ retry: ep_handle, DT_Bpool_GetBuffSize (bpool, 2), dto_cookie, - "Client_Cmd_Send")) + "Client_Cmd_Send") ) { + ret = DAT_INSUFFICIENT_RESOURCES; goto client_exit; } @@ -427,8 +434,9 @@ retry: ep_handle, DT_Bpool_GetBuffSize (bpool, 0), dto_cookie, - "Server_Info_Recv")) + "Server_Info_Recv") ) { + ret = DAT_INSUFFICIENT_RESOURCES; goto client_exit; } @@ -447,6 +455,7 @@ retry: module, pt_ptr->Server_Info.dapltest_version, DAPLTEST_VERSION); + ret = DAT_MODEL_NOT_SUPPORTED; goto client_exit; } DT_Tdep_PT_Debug (1,(phead, "%s: Version OK!\n", module)); @@ -467,7 +476,7 @@ retry: { DT_Transaction_Cmd_PT_Print (phead, Transaction_Cmd); } - DT_Transaction_Test_Client (pt_ptr, + ret = DT_Transaction_Test_Client (pt_ptr, ia_handle, server_netaddr); break; @@ -476,6 +485,7 @@ retry: case QUIT_TEST: { DT_Quit_Cmd_PT_Print (phead, Quit_Cmd); + ret = DAT_SUCCESS; break; } @@ -486,7 +496,7 @@ retry: DT_Performance_Cmd_PT_Print (phead, Performance_Cmd); } - DT_Performance_Test_Client (params_ptr, + ret = DT_Performance_Test_Client (params_ptr, pt_ptr, ia_handle, server_netaddr); @@ -496,6 +506,7 @@ retry: /********************************************************************* * Done - clean up and go home + * ret == function DAT_RETURN return code */ client_exit: DT_Tdep_PT_Debug (1,(phead, "%s: Cleaning Up ...\n", module)); @@ -507,13 +518,13 @@ client_exit: * graceful attempt might fail because we got here due to * some error above, so we may as well try harder. */ - ret = dat_ep_disconnect (ep_handle, DAT_CLOSE_ABRUPT_FLAG); - if (ret != DAT_SUCCESS) + rc = dat_ep_disconnect (ep_handle, DAT_CLOSE_ABRUPT_FLAG); + if (rc != DAT_SUCCESS) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_disconnect (abrupt) error: %s\n", module, - DT_RetToString (ret)); + DT_RetToString (rc)); } else if (did_connect && !DT_disco_event_wait (phead, conn_evd_hdl, NULL)) @@ -535,17 +546,17 @@ client_exit: */ do { - ret = DT_Tdep_evd_dequeue ( recv_evd_hdl, + rc = DT_Tdep_evd_dequeue ( recv_evd_hdl, &event); - } while (ret == DAT_SUCCESS); + } while (rc == DAT_SUCCESS); - ret = dat_ep_free (ep_handle); - if (ret != DAT_SUCCESS) + rc = dat_ep_free (ep_handle); + if (rc != DAT_SUCCESS) { DT_Tdep_PT_Printf (phead, "%s: dat_ep_free error: %s\n", module, - DT_RetToString (ret)); + DT_RetToString (rc)); /* keep going */ } } @@ -553,37 +564,37 @@ client_exit: /* Free the 3 EVDs */ if (conn_evd_hdl) { - ret = DT_Tdep_evd_free (conn_evd_hdl); - if (ret != DAT_SUCCESS) + rc = DT_Tdep_evd_free (conn_evd_hdl); + if (rc != DAT_SUCCESS) { DT_Tdep_PT_Printf (phead, "%s: dat_evd_free (conn) error: %s\n", module, - DT_RetToString (ret)); + DT_RetToString (rc)); /* keep going */ } } if (reqt_evd_hdl) { - ret = DT_Tdep_evd_free (reqt_evd_hdl); - if (ret != DAT_SUCCESS) + rc = DT_Tdep_evd_free (reqt_evd_hdl); + if (rc != DAT_SUCCESS) { DT_Tdep_PT_Printf (phead, "%s: dat_evd_free (reqt) error: %s\n", module, - DT_RetToString (ret)); + DT_RetToString (rc)); /* keep going */ } } if (recv_evd_hdl) { - ret = DT_Tdep_evd_free (recv_evd_hdl); - if (ret != DAT_SUCCESS) + rc = DT_Tdep_evd_free (recv_evd_hdl); + if (rc != DAT_SUCCESS) { DT_Tdep_PT_Printf (phead, "%s: dat_evd_free (recv) error: %s\n", module, - DT_RetToString (ret)); + DT_RetToString (rc)); /* keep going */ } } @@ -591,13 +602,13 @@ client_exit: /* Free the PZ */ if (pz_handle) { - ret = dat_pz_free (pz_handle); - if (ret != DAT_SUCCESS) + rc = dat_pz_free (pz_handle); + if (rc != DAT_SUCCESS) { DT_Tdep_PT_Printf (phead, "%s: dat_pz_free error: %s\n", module, - DT_RetToString (ret)); + DT_RetToString (rc)); /* keep going */ } } @@ -606,20 +617,20 @@ client_exit: if (ia_handle) { /* dat_ia_close cleans up async evd handle, too */ - ret = dat_ia_close (ia_handle, DAT_CLOSE_GRACEFUL_FLAG); - if (ret != DAT_SUCCESS) + rc = dat_ia_close (ia_handle, DAT_CLOSE_GRACEFUL_FLAG); + if (rc != DAT_SUCCESS) { DT_Tdep_PT_Printf (phead, "%s: dat_ia_close (graceful) error: %s\n", module, - DT_RetToString (ret)); - ret = dat_ia_close (ia_handle, DAT_CLOSE_ABRUPT_FLAG); - if (ret != DAT_SUCCESS) + DT_RetToString (rc)); + rc = dat_ia_close (ia_handle, DAT_CLOSE_ABRUPT_FLAG); + if (rc != DAT_SUCCESS) { DT_Tdep_PT_Printf (phead, "%s: dat_ia_close (abrupt) error: %s\n", module, - DT_RetToString (ret)); + DT_RetToString (rc)); } /* keep going */ } @@ -638,4 +649,5 @@ client_exit: DT_Tdep_PT_Printf (phead, "%s: ========== End of Work -- Client Exiting\n", module); + return ret; } diff --git a/test/dapltest/test/dapl_execute.c b/test/dapltest/test/dapl_execute.c index 77b61f2..7987f3f 100644 --- a/test/dapltest/test/dapl_execute.c +++ b/test/dapltest/test/dapl_execute.c @@ -35,9 +35,10 @@ #include "dapl_quit_cmd.h" #include "dapl_limit_cmd.h" -void +DAT_RETURN DT_Execute_Test (Params_t *params_ptr) { + DAT_RETURN rc = DAT_SUCCESS; Transaction_Cmd_t *Transaction_Cmd; Quit_Cmd_t *Quit_Cmd; Limit_Cmd_t *Limit_Cmd; @@ -58,45 +59,44 @@ DT_Execute_Test (Params_t *params_ptr) case TRANSACTION_TEST: { Transaction_Cmd = ¶ms_ptr->u.Transaction_Cmd; - DT_cs_Client ( params_ptr, + rc = DT_cs_Client ( params_ptr, Transaction_Cmd->dapl_name, Transaction_Cmd->server_name, Transaction_Cmd->num_threads * - Transaction_Cmd->eps_per_thread); + Transaction_Cmd->eps_per_thread ); break; } case QUIT_TEST: { Quit_Cmd = ¶ms_ptr->u.Quit_Cmd; - DT_cs_Client ( params_ptr, + (void) DT_cs_Client ( params_ptr, Quit_Cmd->device_name, Quit_Cmd->server_name, - 0); + 0 ); break; } case LIMIT_TEST: { Limit_Cmd = ¶ms_ptr->u.Limit_Cmd; - DT_cs_Limit (params_ptr, - Limit_Cmd); + rc = DT_cs_Limit (params_ptr, Limit_Cmd); break; } case PERFORMANCE_TEST: { Performance_Cmd = ¶ms_ptr->u.Performance_Cmd; - DT_cs_Client ( params_ptr, + rc = DT_cs_Client ( params_ptr, Performance_Cmd->dapl_name, Performance_Cmd->server_name, - 1); + 1 ); break; } case FFT_TEST: { FFT_Cmd = ¶ms_ptr->u.FFT_Cmd; - DT_cs_FFT (params_ptr, - FFT_Cmd); + rc = DT_cs_FFT (params_ptr, FFT_Cmd); break; } } + return rc; } diff --git a/test/dapltest/test/dapl_fft_test.c b/test/dapltest/test/dapl_fft_test.c index 05c782a..0a58e13 100644 --- a/test/dapltest/test/dapl_fft_test.c +++ b/test/dapltest/test/dapl_fft_test.c @@ -30,10 +30,11 @@ #include "dapl_proto.h" -void +DAT_RETURN DT_cs_FFT (Params_t *params_ptr, FFT_Cmd_t * cmd) { DT_Tdep_Print_Head *phead; + DAT_RETURN rc = DAT_SUCCESS; phead = params_ptr->phead; @@ -85,8 +86,10 @@ DT_cs_FFT (Params_t *params_ptr, FFT_Cmd_t * cmd) default: { DT_Tdep_PT_Printf (phead, "don't know this test\n"); + rc = DAT_INVALID_PARAMETER; break; } } + return rc; } diff --git a/test/dapltest/test/dapl_limit.c b/test/dapltest/test/dapl_limit.c index 133b3e0..78e5f14 100644 --- a/test/dapltest/test/dapl_limit.c +++ b/test/dapltest/test/dapl_limit.c @@ -36,13 +36,13 @@ static bool more_handles (DT_Tdep_Print_Head *phead, - void **old_ptrptr, /* pointer to current pointer */ + DAT_HANDLE **old_ptrptr, /* pointer to current pointer */ unsigned int *old_count, /* number pointed to */ unsigned int size) /* size of one datum */ { unsigned int count = *old_count; - void *old_handles = *old_ptrptr; - void *handle_tmp = DT_Mdep_Malloc (count * 2 * size); + DAT_HANDLE *old_handles = *old_ptrptr; + DAT_HANDLE *handle_tmp = DT_Mdep_Malloc (count * 2 * size); if (!handle_tmp) { @@ -172,8 +172,8 @@ limit_test ( DT_Tdep_Print_Head *phead, } OneOpen; unsigned int count = START_COUNT; - void *hptr = DT_Mdep_Malloc (count * sizeof(OneOpen)); - OneOpen *hdlptr = (OneOpen *)hptr; + OneOpen *hdlptr = (OneOpen *) + DT_Mdep_Malloc (count * sizeof (*hdlptr)); /* IA Exhaustion test loop */ if (hdlptr) @@ -186,13 +186,14 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) + && !more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count, + sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: IAs opened: %d\n", module, w); retval = true; break; } - hdlptr = (OneOpen *)hptr; /* Specify that we want to get back an async EVD. */ hdlptr[w].ia_async_handle = DAT_HANDLE_NULL; ret = dat_ia_open (cmd->device_name, @@ -270,8 +271,8 @@ limit_test ( DT_Tdep_Print_Head *phead, * See how many PZs we can create */ unsigned int count = START_COUNT; - void *hptr = DT_Mdep_Malloc (count * sizeof(DAT_PZ_HANDLE)); - DAT_PZ_HANDLE *hdlptr = (DAT_PZ_HANDLE *)hptr; + DAT_PZ_HANDLE *hdlptr = (DAT_PZ_HANDLE *) + DT_Mdep_Malloc (count * sizeof (*hdlptr)); /* PZ Exhaustion test loop */ if (hdlptr) @@ -286,13 +287,14 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) + && !more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count, + sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: PZs created: %d\n", module, w); retval = true; break; } - hdlptr = (DAT_PZ_HANDLE *)hptr; ret = dat_pz_create (hdl_sets[w % cmd->width].ia_handle, &hdlptr[w]); if (ret != DAT_SUCCESS) @@ -367,8 +369,8 @@ limit_test ( DT_Tdep_Print_Head *phead, * See how many CNOs we can create */ unsigned int count = START_COUNT; - void *hptr = DT_Mdep_Malloc (count * sizeof(DAT_CNO_HANDLE)); - DAT_CNO_HANDLE *hdlptr = (DAT_CNO_HANDLE *)hptr; + DAT_CNO_HANDLE *hdlptr = (DAT_CNO_HANDLE *) + DT_Mdep_Malloc (count * sizeof (*hdlptr)); /* CNO Exhaustion test loop */ if (hdlptr) @@ -383,13 +385,14 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles(phead, &hptr, &count, sizeof (*hdlptr))) + && !more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count, + sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: CNOs created: %d\n", module, w); retval = true; break; } - hdlptr = (DAT_CNO_HANDLE *)hptr; ret = dat_cno_create (hdl_sets[w % cmd->width].ia_handle, DAT_OS_WAIT_PROXY_AGENT_NULL, &hdlptr[w]); @@ -487,9 +490,8 @@ limit_test ( DT_Tdep_Print_Head *phead, * See how many EVDs we can create */ unsigned int count = START_COUNT; - void *hptr = DT_Mdep_Malloc(count * sizeof(DAT_EVD_HANDLE)); - DAT_EVD_HANDLE *hdlptr = (DAT_EVD_HANDLE *)hptr; - + DAT_EVD_HANDLE *hdlptr = (DAT_EVD_HANDLE *) + DT_Mdep_Malloc (count * sizeof (*hdlptr)); DAT_EVD_FLAGS flags = ( DAT_EVD_DTO_FLAG | DAT_EVD_RMR_BIND_FLAG | DAT_EVD_CR_FLAG); @@ -522,13 +524,14 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) + && !more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count, + sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: EVDs created: %d\n", module, w); retval = true; break; } - hdlptr = (DAT_EVD_HANDLE *)hptr; ret = DT_Tdep_evd_create (hdl_sets[w % cmd->width].ia_handle, DFLT_QLEN, hdl_sets[w % cmd->width].cno_handle, @@ -606,8 +609,8 @@ limit_test ( DT_Tdep_Print_Head *phead, * See how many EPs we can create */ unsigned int count = START_COUNT; - void *hptr = DT_Mdep_Malloc(count * sizeof(DAT_EP_HANDLE)); - DAT_EP_HANDLE *hdlptr = (DAT_EP_HANDLE *)hptr; + DAT_EP_HANDLE *hdlptr = (DAT_EP_HANDLE *) + DT_Mdep_Malloc (count * sizeof (*hdlptr)); /* EP Exhaustion test loop */ if (hdlptr) @@ -620,13 +623,14 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles(phead, &hptr, &count, sizeof(*hdlptr))) + && !more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count, + sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: EPs created: %d\n", module, w); retval = true; break; } - hdlptr = (DAT_EP_HANDLE *)hptr; ret = dat_ep_create (hdl_sets[w % cmd->width].ia_handle, hdl_sets[w % cmd->width].pz_handle, hdl_sets[w % cmd->width].evd_handle, @@ -676,10 +680,10 @@ limit_test ( DT_Tdep_Print_Head *phead, * See how many RSPs we can create */ unsigned int count = START_COUNT; - void *hptr = DT_Mdep_Malloc(count * sizeof (DAT_RSP_HANDLE)); - DAT_RSP_HANDLE *hdlptr = (DAT_RSP_HANDLE *)hptr; - void *eptr = DT_Mdep_Malloc(count * sizeof (DAT_EP_HANDLE)); - DAT_EP_HANDLE *epptr = (DAT_EP_HANDLE *)eptr; + DAT_RSP_HANDLE *hdlptr = (DAT_RSP_HANDLE *) + DT_Mdep_Malloc (count * sizeof (*hdlptr)); + DAT_EP_HANDLE *epptr = (DAT_EP_HANDLE *) + DT_Mdep_Malloc (count * sizeof (*epptr)); /* RSP Exhaustion test loop */ if (hdlptr) @@ -696,21 +700,23 @@ limit_test ( DT_Tdep_Print_Head *phead, unsigned int count1 = count; unsigned int count2 = count; - if (!more_handles(phead, &hptr, &count1, sizeof(*hdlptr))) + if (!more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count1, + sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: RSPs created: %d\n", module, w); retval = true; break; } - hdlptr = (DAT_RSP_HANDLE *)hptr; - - if (!more_handles (phead, &eptr, &count2, sizeof(*epptr))) + if (!more_handles (phead, (DAT_HANDLE **) &epptr, + &count2, + sizeof (*epptr))) { DT_Tdep_PT_Printf (phead, "%s: RSPs created: %d\n", module, w); retval = true; break; } - epptr = (DAT_EP_HANDLE *)eptr; + if (count1 != count2) { DT_Tdep_PT_Printf (phead, "%s: Mismatch in allocation of handle arrays at point %d\n", @@ -810,8 +816,8 @@ limit_test ( DT_Tdep_Print_Head *phead, * See how many PSPs we can create */ unsigned int count = START_COUNT; - void *hptr = DT_Mdep_Malloc (count * sizeof (DAT_PSP_HANDLE)); - DAT_PSP_HANDLE *hdlptr = (DAT_PSP_HANDLE *)hptr; + DAT_PSP_HANDLE *hdlptr = (DAT_PSP_HANDLE *) + DT_Mdep_Malloc (count * sizeof (*hdlptr)); /* PSP Exhaustion test loop */ if (hdlptr) @@ -824,13 +830,14 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) + && !more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count, + sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: PSPs created: %d\n", module, w); retval = true; break; } - hdlptr = (DAT_PSP_HANDLE *)hptr; ret = dat_psp_create (hdl_sets[w % cmd->width].ia_handle, CONN_QUAL0 + w, hdl_sets[w % cmd->width].evd_handle, @@ -935,8 +942,8 @@ limit_test ( DT_Tdep_Print_Head *phead, * See how many LMRs we can create */ unsigned int count = START_COUNT; - void *hptr = DT_Mdep_Malloc (count * sizeof(Bpool*)); - Bpool **hdlptr = (Bpool **)hptr; + Bpool **hdlptr = (Bpool **) + DT_Mdep_Malloc (count * sizeof (*hdlptr)); /* LMR Exhaustion test loop */ if (hdlptr) @@ -949,7 +956,9 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, &hptr, &count, sizeof(*hdlptr))) + && !more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count, + sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: no memory for LMR handles\n", module); @@ -957,7 +966,6 @@ limit_test ( DT_Tdep_Print_Head *phead, retval = true; break; } - hdlptr = (Bpool **)hptr; /* * Let BpoolAlloc do the hard work; this means that * we're testing unique memory registrations rather @@ -1010,9 +1018,8 @@ limit_test ( DT_Tdep_Print_Head *phead, * but that should be OK. */ unsigned int count = START_COUNT; - void *hptr = - DT_Mdep_Malloc(count * cmd->width * sizeof(DAT_LMR_TRIPLET)); - DAT_LMR_TRIPLET *hdlptr = (DAT_LMR_TRIPLET *)hptr; + DAT_LMR_TRIPLET *hdlptr = (DAT_LMR_TRIPLET *) + DT_Mdep_Malloc (count * cmd->width * sizeof (*hdlptr)); /* Recv-Post Exhaustion test loop */ if (hdlptr) @@ -1026,8 +1033,9 @@ limit_test ( DT_Tdep_Print_Head *phead, { DT_Mdep_Schedule(); if (w == count - && !more_handles (phead, &hptr, &count, - cmd->width * sizeof(*hdlptr))) + && !more_handles (phead, (DAT_HANDLE **) &hdlptr, + &count, + cmd->width * sizeof (*hdlptr))) { DT_Tdep_PT_Printf (phead, "%s: no memory for IOVs \n", module); @@ -1039,7 +1047,6 @@ limit_test ( DT_Tdep_Print_Head *phead, done = retval = true; break; } - hdlptr = (DAT_LMR_TRIPLET *)hptr; for (i = 0; i < cmd->width; i++) { DAT_LMR_TRIPLET *iovp = &hdlptr[w * cmd->width + i]; @@ -1344,7 +1351,7 @@ clean_up_now: /********************************************************************* * Framework to run through all of the limit tests */ -void +DAT_RETURN DT_cs_Limit (Params_t *params, Limit_Cmd_t * cmd) { DT_Tdep_Print_Head *phead; @@ -1537,11 +1544,11 @@ DT_cs_Limit (Params_t *params, Limit_Cmd_t * cmd) /* More tests TBS ... */ - return; + return DAT_SUCCESS; error: DT_Tdep_PT_Printf (phead, "error occurs, can not continue with limit test\n"); DT_Tdep_PT_Printf (phead, "%s\n", star); - return; + return DAT_INSUFFICIENT_RESOURCES; } diff --git a/test/dapltest/test/dapl_performance_client.c b/test/dapltest/test/dapl_performance_client.c index d1dec89..42e8257 100644 --- a/test/dapltest/test/dapl_performance_client.c +++ b/test/dapltest/test/dapl_performance_client.c @@ -33,7 +33,7 @@ #define MAX_CONN_RETRY 8 /****************************************************************************/ -void +DAT_RETURN DT_Performance_Test_Client ( Params_t *params_ptr, Per_Test_Data_t *pt_ptr, @@ -43,6 +43,7 @@ DT_Performance_Test_Client ( Performance_Test_t *test_ptr = NULL; int connected = 1; DT_Tdep_Print_Head *phead; + DAT_RETURN rc; phead = pt_ptr->Params.phead; @@ -85,6 +86,8 @@ DT_Performance_Test_Client ( #endif DT_Tdep_PT_Debug (1,(phead,"Client: Finished performance test\n")); + + return (connected ? DAT_SUCCESS : DAT_INSUFFICIENT_RESOURCES); } diff --git a/test/dapltest/test/dapl_transaction_test.c b/test/dapltest/test/dapl_transaction_test.c index 82ee6f9..4abda1e 100644 --- a/test/dapltest/test/dapl_transaction_test.c +++ b/test/dapltest/test/dapl_transaction_test.c @@ -48,7 +48,7 @@ #define MAX_CONN_RETRY 8 /****************************************************************************/ -void +DAT_RETURN DT_Transaction_Test_Client (Per_Test_Data_t * pt_ptr, DAT_IA_HANDLE ia_handle, DAT_IA_ADDRESS_PTR remote_ia_addr) @@ -56,6 +56,7 @@ DT_Transaction_Test_Client (Per_Test_Data_t * pt_ptr, Transaction_Cmd_t *cmd = &pt_ptr->Params.u.Transaction_Cmd; unsigned int i; DT_Tdep_Print_Head *phead; + DAT_RETURN rc = DAT_SUCCESS; phead = pt_ptr->Params.phead; @@ -77,6 +78,7 @@ DT_Transaction_Test_Client (Per_Test_Data_t * pt_ptr, remote_ia_addr)) { DT_Tdep_PT_Printf (phead, "Client: Cannot Create Test!\n"); + rc = DAT_INSUFFICIENT_RESOURCES; break; } @@ -97,6 +99,7 @@ DT_Transaction_Test_Client (Per_Test_Data_t * pt_ptr, &pt_ptr->Client_Stats, cmd->num_threads, cmd->eps_per_thread); + return rc; } @@ -1108,9 +1111,6 @@ retry: */ success = DT_Transaction_Run (phead, test_ptr); - /* no sync at end of transaction run, wait before cleanup */ - sleep(1); - /* * Now clean up and go home */ @@ -1202,21 +1202,8 @@ test_failure: if ( test_ptr->ep_context[j].ep_handle == ep_handle ) { test_ptr->ep_context[j].ep_handle = NULL; - break; - } } - if (j == test_ptr->cmd->eps_per_thread) - { - /* invalid ep_handle returned */ - DT_Tdep_PT_Printf(phead, - "Test[" F64x "]: disconnect" - " event with unknown EP=%p " - " possible duplicate\n", - test_ptr->base_port, - ep_handle); - ep_handle = NULL; } - } } else /* !success - QP may be in error state */ diff --git a/test/dapltest/udapl/udapl_tdep.c b/test/dapltest/udapl/udapl_tdep.c index 9b3c93f..da1269b 100644 --- a/test/dapltest/udapl/udapl_tdep.c +++ b/test/dapltest/udapl/udapl_tdep.c @@ -42,10 +42,10 @@ DT_Tdep_End (void) DT_Mdep_LockDestroy (&g_PerfTestLock); /* For kDAPL, this is done in kdapl_module.c */ } -void +DAT_RETURN DT_Tdep_Execute_Test (Params_t *params_ptr) { - DT_Execute_Test (params_ptr); + return DT_Execute_Test (params_ptr); } DAT_RETURN diff --git a/test/dapltest/windows/SOURCES b/test/dapltest/windows/SOURCES new file mode 100644 index 0000000..31570c3 --- /dev/null +++ b/test/dapltest/windows/SOURCES @@ -0,0 +1,33 @@ +!if $(FREEBUILD) +TARGETNAME = dapl2test +!else +TARGETNAME = dapl2testd +!endif + +TARGETPATH = ..\..\..\..\..\bin\user\obj$(BUILD_ALT_DIR) +TARGETTYPE = PROGRAM +UMTYPE = console +USE_MSVCRT = 1 + +SOURCES = \ + dapltest.rc \ + ..\dt_cmd.c \ + ..\dt_test.c \ + ..\dt_common.c \ + ..\dt_udapl.c \ + ..\dt_mdep.c + +INCLUDES=..\include;..\mdep\windows;..\..\..\dat\include;%DDK_INC_PATH% + +RCOPTIONS=/I..\..\..\..\..\inc; + +!if $(FREEBUILD) +DATLIB = dat2.lib +!else +DATLIB = dat2d.lib +!endif + +TARGETLIBS = $(TARGETPATH)\*\$(DATLIB) $(SDK_LIB_PATH)\ws2_32.lib + +# XXX do this ASAP - MSC_WARNING_LEVEL= /W3 +MSC_WARNING_LEVEL= /W1 diff --git a/test/dapltest/windows/dapltest.rc b/test/dapltest/windows/dapltest.rc new file mode 100644 index 0000000..f41ac8b --- /dev/null +++ b/test/dapltest/windows/dapltest.rc @@ -0,0 +1,50 @@ +/* + * Copyright (c) 2007 Intel Corporation. All rights reserved. + * + * This software is available to you under the OpenIB.org BSD license + * below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +#include + +#define VER_FILETYPE VFT_APP +#define VER_FILESUBTYPE VFT2_UNKNOWN + +#if DBG +#define VER_FILEDESCRIPTION_STR "DAPL/DAT[2.0] test Application (Debug)" +#define VER_INTERNALNAME_STR "dapl2testd.exe" +#define VER_ORIGINALFILENAME_STR "dapl2testd.exe" + +#else +#define VER_FILEDESCRIPTION_STR "DAPL/DAT[2.0] test Application" +#define VER_INTERNALNAME_STR "dapl2test.exe" +#define VER_ORIGINALFILENAME_STR "dapl2test.exe" + +#endif + +#include diff --git a/test/dapltest/windows/makefile b/test/dapltest/windows/makefile new file mode 100644 index 0000000..d493855 --- /dev/null +++ b/test/dapltest/windows/makefile @@ -0,0 +1,7 @@ +# +# DO NOT EDIT THIS FILE!!! Edit .\sources. if you want to add a new source +# file to this component. This file merely indirects to the real make file +# that is shared by all the driver components of the OpenIB Windows project. +# + +!INCLUDE ..\..\..\..\..\inc\openib.def From sean.hefty at intel.com Fri Jan 30 10:56:26 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 30 Jan 2009 10:56:26 -0800 Subject: [ofa-general] [PATCH 3/5] [DAPL] dapl/test: merge WinOF changes to dtest In-Reply-To: References: Message-ID: <61CBB4D952A942D78C260EF8C5E94DA1@amr.corp.intel.com> Changes to dtest to support building on Windows. Signed-off-by: Sean Hefty --- test/dirs | 2 + test/dtest/SOURCES | 34 ------------------------- test/dtest/dirs | 1 + test/dtest/dtest.c | 4 +-- test/dtest/dtest.rc | 48 ----------------------------------- test/dtest/dtestx.c | 2 + test/dtest/makefile.am | 7 ----- test/dtest/windows/dirs | 1 + test/dtest/windows/dtest/SOURCES | 33 ++++++++++++++++++++++++ test/dtest/windows/dtest/dtest.c | 2 + test/dtest/windows/dtest/dtest.rc | 48 +++++++++++++++++++++++++++++++++++ test/dtest/windows/dtest/makefile | 7 +++++ test/dtest/windows/dtestx/SOURCES | 28 ++++++++++++++++++++ test/dtest/windows/dtestx/dtestx.c | 1 + test/dtest/windows/dtestx/dtestx.rc | 48 +++++++++++++++++++++++++++++++++++ test/dtest/windows/dtestx/makefile | 7 +++++ 16 files changed, 180 insertions(+), 93 deletions(-) diff --git a/test/dirs b/test/dirs index cf0ef2b..b6232e9 100644 --- a/test/dirs +++ b/test/dirs @@ -1 +1 @@ -DIRS=dapltest dtest dtestx +DIRS = dapltest dtest diff --git a/test/dtest/SOURCES b/test/dtest/SOURCES deleted file mode 100644 index 676e34c..0000000 --- a/test/dtest/SOURCES +++ /dev/null @@ -1,34 +0,0 @@ -!if $(FREEBUILD) -TARGETNAME=dtest2 -!else -TARGETNAME=dtest2d -!endif -TARGETPATH=..\..\..\..\bin\user\obj$(BUILD_ALT_DIR) -TARGETTYPE=PROGRAM -UMTYPE=console -USE_MSVCRT=1 - -SOURCES=dtest.rc \ - dtest.c \ - getopt.c - -INCLUDES=.;..\..\dat\include;\ - ../../../../inc;..\..\..\..\inc\user;\ - $(SDK_INC_PATH); - -RCOPTIONS=/I..\..\..\..\inc; - -# Set defines particular to the driver. -#USER_C_FLAGS=$(USER_C_FLAGS) /DDAT_EXTENSIONS - -!if $(FREEBUILD) -DATLIB=dat2.lib -!else -DATLIB=dat2d.lib -!endif - -TARGETLIBS=$(TARGETPATH)\*\$(DATLIB) $(SDK_LIB_PATH)\ws2_32.lib - -# XXX do this ASAP - MSC_WARNING_LEVEL= /W3 -MSC_WARNING_LEVEL= /W1 - diff --git a/test/dtest/dirs b/test/dtest/dirs new file mode 100644 index 0000000..f3b2bb0 --- /dev/null +++ b/test/dtest/dirs @@ -0,0 +1 @@ +dirs = windows diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 37edd6c..6fe6a4f 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -46,10 +46,10 @@ #include #include #include -#include #include +#include "..\..\..\..\etc\user\getopt.c" -#define getpid _getpid +#define getpid() ((int)GetCurrentProcessId()) #define F64x "%I64x" #ifdef DBG diff --git a/test/dtest/dtest.rc b/test/dtest/dtest.rc deleted file mode 100644 index 9253096..0000000 --- a/test/dtest/dtest.rc +++ /dev/null @@ -1,48 +0,0 @@ -/* - * Copyright (c) 2007 Intel Corporation. All rights reserved. - * - * This software is available to you under the OpenIB.org BSD license - * below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - * $Id$ - */ - - -#include - -#define VER_FILETYPE VFT_APP -#define VER_FILESUBTYPE VFT2_UNKNOWN - -#if DBG -#define VER_FILEDESCRIPTION_STR "Simple DAPL/DAT svr/cli test Application (Debug)" -#define VER_INTERNALNAME_STR "dtest2d.exe" -#define VER_ORIGINALFILENAME_STR "dtest2d.exe" -#else -#define VER_FILEDESCRIPTION_STR "Simple DAPL/DAT svr/cli test Application" -#define VER_INTERNALNAME_STR "dtest2.exe" -#define VER_ORIGINALFILENAME_STR "dtest2.exe" -#endif - -#include diff --git a/test/dtest/dtestx.c b/test/dtest/dtestx.c index 24536a1..032cadd 100755 --- a/test/dtest/dtestx.c +++ b/test/dtest/dtestx.c @@ -41,9 +41,9 @@ #include #include #include +#include "..\..\..\..\etc\user\getopt.c" #define __BYTE_ORDER __LITTLE_ENDIAN -#define getpid _getpid #define F64x "%I64x" #define DAPL_PROVIDER "ibnic0v2" #else diff --git a/test/dtest/makefile.am b/test/dtest/makefile.am deleted file mode 100644 index e26e1c0..0000000 --- a/test/dtest/makefile.am +++ /dev/null @@ -1,7 +0,0 @@ -# -# DO NOT EDIT THIS FILE!!! Edit .\sources. if you want to add a new source -# file to this component. This file merely indirects to the real make file -# that is shared by all the driver components of the OpenIB Windows project. -# - -!INCLUDE ..\..\..\..\inc\openib.def diff --git a/test/dtest/windows/dirs b/test/dtest/windows/dirs new file mode 100644 index 0000000..cac5a54 --- /dev/null +++ b/test/dtest/windows/dirs @@ -0,0 +1 @@ +dirs = dtest dtestx diff --git a/test/dtest/windows/dtest/SOURCES b/test/dtest/windows/dtest/SOURCES new file mode 100644 index 0000000..58f5c66 --- /dev/null +++ b/test/dtest/windows/dtest/SOURCES @@ -0,0 +1,33 @@ +!if $(FREEBUILD) +TARGETNAME = dtest2 +!else +TARGETNAME = dtest2d +!endif + +TARGETPATH = ..\..\..\..\..\..\bin\user\obj$(BUILD_ALT_DIR) +TARGETTYPE = PROGRAM +UMTYPE = console +USE_MSVCRT = 1 + +SOURCES = \ + dtest.rc \ + dtest.c + +INCLUDES = ..\..\..\..\dat\include;..\..\..\..\..\..\inc;\ + ..\..\..\..\..\..\inc\user; + +RCOPTIONS=/I..\..\..\..\..\..\inc; + +# Set defines particular to the driver. +#USER_C_FLAGS = $(USER_C_FLAGS) /DDAT_EXTENSIONS + +!if $(FREEBUILD) +DATLIB = dat2.lib +!else +DATLIB = dat2d.lib +!endif + +TARGETLIBS = $(TARGETPATH)\*\$(DATLIB) $(SDK_LIB_PATH)\ws2_32.lib + +# XXX do this ASAP - MSC_WARNING_LEVEL= /W3 +MSC_WARNING_LEVEL = /W1 diff --git a/test/dtest/windows/dtest/dtest.c b/test/dtest/windows/dtest/dtest.c new file mode 100644 index 0000000..ae69c20 --- /dev/null +++ b/test/dtest/windows/dtest/dtest.c @@ -0,0 +1,2 @@ +#include "..\..\dtest.c" + diff --git a/test/dtest/windows/dtest/dtest.rc b/test/dtest/windows/dtest/dtest.rc new file mode 100644 index 0000000..9253096 --- /dev/null +++ b/test/dtest/windows/dtest/dtest.rc @@ -0,0 +1,48 @@ +/* + * Copyright (c) 2007 Intel Corporation. All rights reserved. + * + * This software is available to you under the OpenIB.org BSD license + * below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +#include + +#define VER_FILETYPE VFT_APP +#define VER_FILESUBTYPE VFT2_UNKNOWN + +#if DBG +#define VER_FILEDESCRIPTION_STR "Simple DAPL/DAT svr/cli test Application (Debug)" +#define VER_INTERNALNAME_STR "dtest2d.exe" +#define VER_ORIGINALFILENAME_STR "dtest2d.exe" +#else +#define VER_FILEDESCRIPTION_STR "Simple DAPL/DAT svr/cli test Application" +#define VER_INTERNALNAME_STR "dtest2.exe" +#define VER_ORIGINALFILENAME_STR "dtest2.exe" +#endif + +#include diff --git a/test/dtest/windows/dtest/makefile b/test/dtest/windows/dtest/makefile new file mode 100644 index 0000000..5fb2ee8 --- /dev/null +++ b/test/dtest/windows/dtest/makefile @@ -0,0 +1,7 @@ +# +# DO NOT EDIT THIS FILE!!! Edit .\sources. if you want to add a new source +# file to this component. This file merely indirects to the real make file +# that is shared by all the driver components of the OpenIB Windows project. +# + +!INCLUDE ..\..\..\..\..\..\inc\openib.def diff --git a/test/dtest/windows/dtestx/SOURCES b/test/dtest/windows/dtestx/SOURCES new file mode 100644 index 0000000..ee63817 --- /dev/null +++ b/test/dtest/windows/dtestx/SOURCES @@ -0,0 +1,28 @@ +!if $(FREEBUILD) +TARGETNAME = dtestx +!else +TARGETNAME = dtestxd +!endif + +TARGETPATH = ..\..\..\..\..\..\bin\user\obj$(BUILD_ALT_DIR) +TARGETTYPE = PROGRAM +UMTYPE = console +USE_MSVCRT = 1 + +SOURCES = \ + dtestx.rc \ + dtestx.c + +INCLUDES = ..\..\..\..\dat\include;..\..\..\..\..\..\inc;\ + ..\..\..\..\..\..\inc\user; + +!if $(FREEBUILD) +DATLIB = dat2.lib +!else +DATLIB = dat2d.lib +!endif + +TARGETLIBS = $(TARGETPATH)\*\$(DATLIB) $(SDK_LIB_PATH)\ws2_32.lib + +# XXX do this ASAP - MSC_WARNING_LEVEL= /W3 +MSC_WARNING_LEVEL = /W1 diff --git a/test/dtest/windows/dtestx/dtestx.c b/test/dtest/windows/dtestx/dtestx.c new file mode 100644 index 0000000..4cc23c1 --- /dev/null +++ b/test/dtest/windows/dtestx/dtestx.c @@ -0,0 +1 @@ +#include "..\..\dtestx.c" diff --git a/test/dtest/windows/dtestx/dtestx.rc b/test/dtest/windows/dtestx/dtestx.rc new file mode 100644 index 0000000..98659e0 --- /dev/null +++ b/test/dtest/windows/dtestx/dtestx.rc @@ -0,0 +1,48 @@ +/* + * Copyright (c) 2007 Intel Corporation. All rights reserved. + * + * This software is available to you under the OpenIB.org BSD license + * below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + + +#include + +#define VER_FILETYPE VFT_APP +#define VER_FILESUBTYPE VFT2_UNKNOWN + +#if DBG +#define VER_FILEDESCRIPTION_STR "DAT/DAPL v2.0 extensions cli/svr test (Debug)" +#define VER_INTERNALNAME_STR "dtestxd.exe" +#define VER_ORIGINALFILENAME_STR "dtestxd.exe" +#else +#define VER_FILEDESCRIPTION_STR "DAT/DAPL v2.0 Extensions cli/svr test" +#define VER_INTERNALNAME_STR "dtestx.exe" +#define VER_ORIGINALFILENAME_STR "dtestx.exe" +#endif + +#include diff --git a/test/dtest/windows/dtestx/makefile b/test/dtest/windows/dtestx/makefile new file mode 100644 index 0000000..5fb2ee8 --- /dev/null +++ b/test/dtest/windows/dtestx/makefile @@ -0,0 +1,7 @@ +# +# DO NOT EDIT THIS FILE!!! Edit .\sources. if you want to add a new source +# file to this component. This file merely indirects to the real make file +# that is shared by all the driver components of the OpenIB Windows project. +# + +!INCLUDE ..\..\..\..\..\..\inc\openib.def From sean.hefty at intel.com Fri Jan 30 10:58:06 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 30 Jan 2009 10:58:06 -0800 Subject: [ofa-general] [PATCH 4/5] [DAPL] dapl/ibal: update IBAL provider with latest WinOF codebase In-Reply-To: References: Message-ID: From: Stan Smith Merge SVN IBAL provider code back into the main git tree. Signed-off-by: Sean Hefty --- Actual codng changes were made by Stan. I'm just submitting the patch to update the DAPL git repository. dapl/dirs | 2 dapl/ibal/dapl_ibal_cm.c | 274 ++++++++++++++++++++++-------------- dapl/ibal/dapl_ibal_cq.c | 4 - dapl/ibal/dapl_ibal_dto.h | 10 + dapl/ibal/dapl_ibal_extensions.c | 17 -- dapl/ibal/dapl_ibal_name_service.h | 2 dapl/ibal/dapl_ibal_qp.c | 12 +- dapl/ibal/dapl_ibal_util.c | 63 ++++++++ dapl/ibal/dapl_ibal_util.h | 85 ++++++++++- 9 files changed, 322 insertions(+), 147 deletions(-) diff --git a/dapl/dirs b/dapl/dirs index a091968..6fb48fa 100644 --- a/dapl/dirs +++ b/dapl/dirs @@ -1 +1 @@ -DIRS=udapl udapl_scm +DIRS = ibal ibal-scm diff --git a/dapl/ibal/dapl_ibal_cm.c b/dapl/ibal/dapl_ibal_cm.c index a986430..fe5501a 100644 --- a/dapl/ibal/dapl_ibal_cm.c +++ b/dapl/ibal/dapl_ibal_cm.c @@ -30,6 +30,7 @@ #include "dapl_ibal_util.h" #include "dapl_name_service.h" #include "dapl_ibal_name_service.h" +#include "dapl_cookie.h" #define IB_INFINITE_SERVICE_LEASE 0xFFFFFFFF #define DAPL_ATS_SERVICE_ID ATS_SERVICE_ID //0x10000CE100415453 @@ -78,15 +79,35 @@ dapli_ib_cm_event_str(ib_cm_events_t e) } -static void -dapli_ibal_listen_err_cb ( - IN ib_listen_err_rec_t *p_listen_err_rec ) +#if defined(DAPL_DBG) + +void dapli_print_private_data( char *prefix, const uint8_t *pd, int len ) { - UNUSED_PARAM( p_listen_err_rec ); + int i; + + if ( !pd || len <= 0 ) + return; + + dapl_log ( DAPL_DBG_TYPE_CM, "--> %s: private_data:\n ",prefix); - dapl_dbg_log (DAPL_DBG_TYPE_CM, "--> %s: CM callback listen error\n", - "DiLEcb"); + if (len > IB_MAX_REP_PDATA_SIZE) + { + dapl_log ( DAPL_DBG_TYPE_ERR, + " Private data size(%d) > Max(%d), ignored.\n ", + len,DAPL_MAX_PRIVATE_DATA_SIZE); + len = IB_MAX_REP_PDATA_SIZE; + } + + for ( i = 0 ; i < len; i++ ) + { + dapl_log ( DAPL_DBG_TYPE_CM, "%2x ", pd[i]); + if ( ((i+1) % 20) == 0 ) + dapl_log ( DAPL_DBG_TYPE_CM, "\n "); + } + dapl_log ( DAPL_DBG_TYPE_CM, "\n"); } +#endif + static void dapli_ib_cm_apr_cb ( @@ -110,6 +131,7 @@ dapli_ib_cm_lap_cb ( /* * Connection Disconnect Request callback + * We received a DREQ, return a DREP (disconnect reply). */ static void @@ -124,23 +146,34 @@ dapli_ib_cm_dreq_cb ( ep_ptr = (DAPL_EP * __ptr64) p_cm_dreq_rec->qp_context; - if ( ep_ptr == NULL || - ep_ptr->header.magic == DAPL_MAGIC_INVALID ) + if ( DAPL_BAD_PTR(ep_ptr) ) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, - "--> DiCDcb: EP %lx invalid or FREED\n", ep_ptr); + "--> %s: BAD_PTR EP %lx\n", __FUNCTION__, ep_ptr); + return; + } + if ( ep_ptr->header.magic != DAPL_MAGIC_EP ) + { + if ( ep_ptr->header.magic == DAPL_MAGIC_INVALID ) + return; + + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + "--> %s: EP %p BAD_EP_MAGIC %x != wanted %x\n", + __FUNCTION__, ep_ptr, ep_ptr->header.magic, + DAPL_MAGIC_EP ); return; } + dapl_dbg_log (DAPL_DBG_TYPE_CM, - "--> %s() QP %lx EP %lx state %s sent_discreq %d\n", - __FUNCTION__, ep_ptr->qp_handle, ep_ptr, + "--> %s() EP %p, %s sent_discreq %s\n", + __FUNCTION__,ep_ptr, dapl_get_ep_state_str(ep_ptr->param.ep_state), - ep_ptr->sent_discreq ); + (ep_ptr->sent_discreq == DAT_TRUE ? "TRUE":"FALSE")); dapl_os_lock (&ep_ptr->header.lock); if ( ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECTED /*|| ( ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECT_PENDING - && ep_ptr->sent_discreq == DAT_FALSE)*/ ) + && ep_ptr->sent_discreq == DAT_TRUE)*/ ) { dapl_os_unlock (&ep_ptr->header.lock); dapl_dbg_log (DAPL_DBG_TYPE_CM, @@ -150,19 +183,18 @@ dapli_ib_cm_dreq_cb ( } ep_ptr->param.ep_state = DAT_EP_STATE_DISCONNECT_PENDING; - ep_ptr->recv_discreq = DAT_TRUE; dapl_os_unlock (&ep_ptr->header.lock); - dapl_os_memzero (&cm_drep, sizeof ( ib_cm_drep_t)); + dapl_os_memzero (&cm_drep, sizeof(ib_cm_drep_t)); /* Could fail if we received reply from other side, no need to retry */ - /* Wait for any transaction in process holding reference */ - while ( dapl_os_atomic_read(&ep_ptr->req_count) && bail-- > 0 ) + /* Wait for any send ops in process holding reference */ + while (dapls_cb_pending(&ep_ptr->req_buffer) && bail-- > 0 ) { dapl_dbg_log (DAPL_DBG_TYPE_CM, "--> DiCDcb: WAIT for EP=%lx req_count(%d) != 0\n", - ep_ptr, dapl_os_atomic_read(&ep_ptr->req_count)); + ep_ptr, dapls_cb_pending(&ep_ptr->req_buffer)); dapl_os_sleep_usec (100); } @@ -195,6 +227,7 @@ dapli_ib_cm_dreq_cb ( /* * Connection Disconnect Reply callback + * We sent a DREQ and received a DREP. */ static void @@ -207,16 +240,22 @@ dapli_ib_cm_drep_cb ( ep_ptr = (DAPL_EP * __ptr64) p_cm_drep_rec->qp_context; - if ( !ep_ptr || DAPL_BAD_HANDLE(ep_ptr, DAPL_MAGIC_EP) ) + if (p_cm_drep_rec->cm_status) { dapl_dbg_log (DAPL_DBG_TYPE_CM, + "--> %s: DREP cm_status(%s) EP=%p\n", __FUNCTION__, + ib_get_err_str(p_cm_drep_rec->cm_status), ep_ptr); + } + + if ( DAPL_BAD_HANDLE(ep_ptr, DAPL_MAGIC_EP) ) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, "--> %s: BAD EP Handle EP=%lx\n", __FUNCTION__,ep_ptr); return; } dapl_dbg_log (DAPL_DBG_TYPE_CM, - "--> DiCDpcb: QP %lx EP %lx state %s cm_hdl %lx\n", - ep_ptr->qp_handle, ep_ptr, + "--> DiCDpcb: EP %p state %s cm_hdl %p\n",ep_ptr, dapl_get_ep_state_str(ep_ptr->param.ep_state), ep_ptr->cm_handle); @@ -254,6 +293,9 @@ dapli_ib_cm_drep_cb ( } } +/* + * CM reply callback + */ static void dapli_ib_cm_rep_cb ( @@ -268,12 +310,14 @@ dapli_ib_cm_rep_cb ( dapl_os_assert (p_cm_rep_rec != NULL); - dapl_os_memzero (&cm_rtu, sizeof ( ib_cm_rtu_t )); - - dapl_os_assert ( ((DAPL_HEADER * __ptr64) - p_cm_rep_rec->qp_context)->magic == DAPL_MAGIC_EP ); - ep_ptr = (DAPL_EP * __ptr64) p_cm_rep_rec->qp_context; + + if ( DAPL_BAD_HANDLE(ep_ptr, DAPL_MAGIC_EP) ) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, "--> %s: EP %lx invalid or FREED\n", + __FUNCTION__, ep_ptr); + return; + } dapl_dbg_log (DAPL_DBG_TYPE_CM, "--> DiCRpcb: EP = %lx local_max_rdma_read_in %d\n", ep_ptr, p_cm_rep_rec->resp_res); @@ -281,6 +325,7 @@ dapli_ib_cm_rep_cb ( p_ca = (dapl_ibal_ca_t *) ep_ptr->header.owner_ia->hca_ptr->ib_hca_handle; + dapl_os_memzero (&cm_rtu, sizeof ( ib_cm_rtu_t )); cm_rtu.pfn_cm_apr_cb = dapli_ib_cm_apr_cb; cm_rtu.pfn_cm_dreq_cb = dapli_ib_cm_dreq_cb; cm_rtu.p_rtu_pdata = NULL; @@ -302,7 +347,7 @@ dapli_ib_cm_rep_cb ( cm_cb_op = IB_CME_CONNECTED; dapl_dbg_log (DAPL_DBG_TYPE_CM, "--> DiCRpcb: EP %lx Connected req_count %d\n", - ep_ptr, dapl_os_atomic_read(&ep_ptr->req_count)); + ep_ptr, dapls_cb_pending(&ep_ptr->req_buffer)); } else { @@ -311,23 +356,10 @@ dapli_ib_cm_rep_cb ( prd_ptr = (DAPL_PRIVATE * __ptr64) p_cm_rep_rec->p_rep_pdata; -#ifdef DAPL_DBG -#if 0 - { - int i; - - dapl_dbg_log ( DAPL_DBG_TYPE_EP, "--> DiCRpcb: private_data: "); - - for ( i = 0 ; i < IB_MAX_REP_PDATA_SIZE ; i++ ) - { - dapl_dbg_log ( DAPL_DBG_TYPE_EP, - "0x%x ", prd_ptr->private_data[i]); - - } - dapl_dbg_log ( DAPL_DBG_TYPE_EP, "\n"); - - } -#endif +#if defined(DAPL_DBG) && 0 + dapli_print_private_data( "DiCRpcb", + prd_ptr->private_data, + IB_MAX_REP_PDATA_SIZE); #endif dapl_evd_connection_callback ( @@ -349,6 +381,13 @@ dapli_ib_cm_rej_cb ( ep_ptr = (DAPL_EP * __ptr64) p_cm_rej_rec->qp_context; + if ( DAPL_BAD_HANDLE(ep_ptr, DAPL_MAGIC_EP) ) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, "--> %s: EP %lx invalid or FREED\n", + __FUNCTION__, ep_ptr); + return; + } + dapl_dbg_log (DAPL_DBG_TYPE_CM, "--> DiCRjcb: EP = %lx QP = %lx rej reason = 0x%x\n", ep_ptr,ep_ptr->qp_handle,CL_NTOH16(p_cm_rej_rec->rej_status)); @@ -589,6 +628,13 @@ dapli_ib_cm_rtu_cb ( ep_ptr = (DAPL_EP * __ptr64) p_cm_rtu_rec->qp_context; + if ( DAPL_BAD_HANDLE(ep_ptr, DAPL_MAGIC_EP) ) + { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, "--> %s: EP %lx invalid or FREED\n", + __FUNCTION__, ep_ptr); + return; + } + dapl_dbg_log (DAPL_DBG_TYPE_CM | DAPL_DBG_TYPE_CALLBACK, "--> DiCRucb: EP %lx QP %lx\n", ep_ptr, ep_ptr->qp_handle); @@ -708,8 +754,8 @@ dapls_ib_connect ( IN DAT_EP_HANDLE ep_handle, IN DAT_IA_ADDRESS_PTR remote_ia_address, IN DAT_CONN_QUAL remote_conn_qual, - IN DAT_COUNT prd_size, - IN DAPL_PRIVATE *prd_ptr ) + IN DAT_COUNT private_data_size, + IN DAT_PVOID private_data ) { DAPL_EP *ep_ptr; DAPL_IA *ia_ptr; @@ -874,8 +920,9 @@ dapls_ib_connect ( cm_req.p_alt_path = NULL; cm_req.h_qp = ep_ptr->qp_handle; cm_req.qp_type = IB_QPT_RELIABLE_CONN; - cm_req.p_req_pdata = (uint8_t *) prd_ptr; - cm_req.req_length = (uint8_t)prd_size; + cm_req.p_req_pdata = (uint8_t *) private_data; + cm_req.req_length = (uint8_t) + min(private_data_size,IB_MAX_REQ_PDATA_SIZE); /* cm retry to send this request messages, IB max of 4 bits */ cm_req.max_cm_retries = 15; /* timer outside of call, s/be infinite */ /* qp retry to send any wr */ @@ -886,13 +933,13 @@ dapls_ib_connect ( cm_req.init_depth = (uint8_t)ep_ptr->param.ep_attr.max_rdma_read_out; /* time wait before retrying a pkt after receiving a RNR NAK */ - cm_req.rnr_nak_timeout = 12; /* 163.84ms */ + cm_req.rnr_nak_timeout = IB_RNR_NAK_TIMEOUT; /* * number of time local QP should retry after receiving RNR NACK before * reporting an error */ - cm_req.rnr_retry_cnt = 6; /* 7 is infinite */ + cm_req.rnr_retry_cnt = IB_RNR_RETRY_CNT; cm_req.remote_resp_timeout = 16; /* 250ms */ cm_req.local_resp_timeout = 16; /* 250ms */ @@ -962,13 +1009,11 @@ DAT_RETURN dapls_ib_disconnect ( IN DAPL_EP *ep_ptr, IN DAT_CLOSE_FLAGS disconnect_flags ) { - DAPL_IA *ia_ptr; - ib_api_status_t ib_status; + ib_api_status_t ib_status = IB_SUCCESS; ib_cm_dreq_t cm_dreq; - //UNUSED_PARAM( disconnect_flags ); - dapl_os_assert(ep_ptr); + if ( DAPL_BAD_HANDLE(ep_ptr, DAPL_MAGIC_EP) ) { dapl_dbg_log (DAPL_DBG_TYPE_CM, @@ -982,27 +1027,25 @@ dapls_ib_disconnect ( IN DAPL_EP *ep_ptr, return DAT_SUCCESS; } - dapl_dbg_log (DAPL_DBG_TYPE_CM, "--> %s() DsD: EP %lx QP %lx ep_state %s " - "rx_drq %d tx_drq %d Close %s\n", __FUNCTION__, - ep_ptr, ep_ptr->qp_handle, - dapl_get_ep_state_str (ep_ptr->param.ep_state), + dapl_dbg_log (DAPL_DBG_TYPE_CM, + "--> %s() EP %p %s rx_drq %d tx_drq %d Close %s\n", __FUNCTION__, + ep_ptr, dapl_get_ep_state_str(ep_ptr->param.ep_state), ep_ptr->recv_discreq, ep_ptr->sent_discreq, - (disconnect_flags == DAT_CLOSE_ABRUPT_FLAG - ? "Abrupt":"Graceful")); + (disconnect_flags == DAT_CLOSE_ABRUPT_FLAG ? "Abrupt":"Graceful")); - if ( disconnect_flags == DAT_CLOSE_ABRUPT_FLAG ) { - dapl_ep_legacy_post_disconnect(ep_ptr, disconnect_flags); + if ( disconnect_flags == DAT_CLOSE_ABRUPT_FLAG ) + { + if ( ep_ptr->param.ep_state == DAT_EP_STATE_DISCONNECTED ) return DAT_SUCCESS; - } - if ( ep_ptr->param.ep_state != DAT_EP_STATE_CONNECTED ) + if ( ep_ptr->param.ep_state != DAT_EP_STATE_DISCONNECT_PENDING ) { - dapl_dbg_log (DAPL_DBG_TYPE_CM, - "--> DsD: EP %lx NOT connected state %s\n", - ep_ptr, dapl_get_ep_state_str (ep_ptr->param.ep_state)); + dapl_dbg_log(DAPL_DBG_TYPE_CM, + "%s() calling legacy_post_disconnect()\n",__FUNCTION__); + dapl_ep_legacy_post_disconnect(ep_ptr, disconnect_flags); + return DAT_SUCCESS; + } } - ia_ptr = ep_ptr->header.owner_ia; - ib_status = IB_SUCCESS; dapl_os_memzero(&cm_dreq, sizeof(ib_cm_dreq_t)); @@ -1017,19 +1060,32 @@ dapls_ib_disconnect ( IN DAPL_EP *ep_ptr, cm_dreq.p_dreq_pdata = NULL; cm_dreq.flags = IB_FLAGS_SYNC; + /* + * still need to send DREQ (disconnect request)? + */ if ( (ep_ptr->recv_discreq == DAT_FALSE) - && (ep_ptr->sent_discreq == DAT_FALSE) ) + && (ep_ptr->sent_discreq == DAT_FALSE) + && (ep_ptr->qp_state != IB_QPS_RESET) ) { ep_ptr->sent_discreq = DAT_TRUE; - ib_status = ib_cm_dreq ( &cm_dreq ); + /* tolerate INVALID_STATE error as the other side can race ahead and + * generate a DREQ before we do. + */ + if ( ib_status == IB_INVALID_STATE ) + ib_status = IB_SUCCESS; + if (ib_status) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "%s() EP %p ib_cm_dreq() status %s\n", + __FUNCTION__,ep_ptr,ib_get_err_str(ib_status)); + } + if ( ib_status == IB_SUCCESS ) dapl_dbg_log (DAPL_DBG_TYPE_CM, - "--> DsD: EP %lx QP %lx DREQ SENT status %s\n", - ep_ptr, ep_ptr->qp_handle,ib_get_err_str(ib_status)); + "--> DsD: EP %p DREQ SENT\n", ep_ptr); } - - return dapl_ib_status_convert (ib_status); + return ib_status; } @@ -1127,7 +1183,6 @@ dapls_ib_setup_conn_listener ( ib_status = ib_cm_listen ( dapl_ibal_root.h_al, &cm_listen, - dapli_ibal_listen_err_cb, (void *) sp_ptr, &sp_ptr->cm_srvc_handle ); @@ -1256,8 +1311,22 @@ dapls_ib_reject_connection ( IN dp_ib_cm_handle_t ib_cm_handle, cm_rej.rej_status = IB_REJ_USER_DEFINED; cm_rej.p_ari = (ib_ari_t *)&rej_table[reject_reason]; cm_rej.ari_length = (uint8_t)strlen (rej_table[reject_reason]); - cm_rej.p_rej_pdata = NULL; - cm_rej.rej_length = 0; + + if (private_data_size > + dapls_ib_private_data_size(NULL,DAPL_PDATA_CONN_REJ,NULL)) + { + dapl_dbg_log ( DAPL_DBG_TYPE_ERR, + "--> DsRjC: private_data size(%d) > Max(%d)\n", + private_data_size, IB_MAX_REJ_PDATA_SIZE ); + return DAT_ERROR(DAT_INVALID_PARAMETER, DAT_INVALID_ARG3); + } + + cm_rej.p_rej_pdata = private_data; + cm_rej.rej_length = private_data_size; + +#if defined(DAPL_DBG) && 0 + dapli_print_private_data("DsRjC",private_data,private_data_size); +#endif ib_status = ib_cm_rej ( *ib_cm_handle, &cm_rej); @@ -1269,7 +1338,6 @@ dapls_ib_reject_connection ( IN dp_ib_cm_handle_t ib_cm_handle, } return ( dapl_ib_status_convert ( ib_status ) ); - } @@ -1286,7 +1354,6 @@ dapli_query_qp( ib_qp_handle_t qp_handle, ib_qp_attr_t *qpa ) dapl_dbg_log ( DAPL_DBG_TYPE_ERR,"ib_query_qp(%lx) '%s'\n", qp_handle, ib_get_err_str(ib_status) ); } -#if 1 else { dapl_dbg_log ( DAPL_DBG_TYPE_CM, "--> QP(%lx) state %s " @@ -1297,7 +1364,6 @@ dapli_query_qp( ib_qp_handle_t qp_handle, ib_qp_attr_t *qpa ) qpa->init_depth, qpa->access_ctrl ); } -#endif } #endif @@ -1310,8 +1376,8 @@ dapli_query_qp( ib_qp_handle_t qp_handle, ib_qp_attr_t *qpa ) * Input: * cr_handle * ep_handle - * private_data_size - ignored as DAT layer sets 0 - * private_data - ignored as DAT layer sets NULL + * private_data_size + * private_data * * Output: * none @@ -1326,8 +1392,8 @@ DAT_RETURN dapls_ib_accept_connection ( IN DAT_CR_HANDLE cr_handle, IN DAT_EP_HANDLE ep_handle, - IN DAT_COUNT p_size, - IN DAPL_PRIVATE *prd_ptr ) + IN DAT_COUNT private_data_size, + IN const DAT_PVOID private_data ) { DAPL_CR *cr_ptr; DAPL_EP *ep_ptr; @@ -1414,24 +1480,21 @@ dapls_ib_accept_connection ( cm_rep.h_qp = ep_ptr->qp_handle; cm_rep.qp_type = IB_QPT_RELIABLE_CONN; - cm_rep.p_rep_pdata = (uint8_t *) cr_ptr->private_data; - cm_rep.rep_length = IB_REQ_PDATA_SIZE; - -#if defined(DAPL_DBG) && 0 - { - int i; - - dapl_dbg_log ( DAPL_DBG_TYPE_EP, "--> DsAC: private_data: "); - for ( i = 0 ; i < IB_MAX_REP_PDATA_SIZE ; i++ ) - { - dapl_dbg_log ( DAPL_DBG_TYPE_EP, - "0x%x ", prd_ptr->private_data[i]); + if (private_data_size > IB_MAX_REP_PDATA_SIZE) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + "--> DsIBAC: private_data_size(%d) > Max(%d)\n", + private_data_size, IB_MAX_REP_PDATA_SIZE); + return DAT_ERROR(DAT_LENGTH_ERROR, DAT_NO_SUBTYPE); } - dapl_dbg_log ( DAPL_DBG_TYPE_EP, "\n"); + cm_rep.p_rep_pdata = (const uint8_t *)private_data; + cm_rep.rep_length = private_data_size; - } +#if defined(DAPL_DBG) && 0 + dapli_print_private_data( "DsIBAC", + (const uint8_t*)private_data, + private_data_size ); #endif cm_rep.pfn_cm_rej_cb = dapli_ib_cm_rej_cb; @@ -1467,15 +1530,15 @@ dapls_ib_accept_connection ( cm_rep.flags = 0; cm_rep.failover_accepted = IB_FAILOVER_ACCEPT_UNSUPPORTED; cm_rep.target_ack_delay = 14; - cm_rep.rnr_nak_timeout = 12; - cm_rep.rnr_retry_cnt = 6; + cm_rep.rnr_nak_timeout = IB_RNR_NAK_TIMEOUT; + cm_rep.rnr_retry_cnt = IB_RNR_RETRY_CNT; cm_rep.pp_recv_failure = NULL; cm_rep.p_recv_wr = NULL; dapl_dbg_log (DAPL_DBG_TYPE_CM, "--> DsIBAC: cm_rep: acc %x init %d qp_type %x req_count %d\n", cm_rep.access_ctrl, cm_rep.init_depth,cm_rep.qp_type, - dapl_os_atomic_read(&ep_ptr->req_count)); + dapls_cb_pending(&ep_ptr->req_buffer)); ib_status = ib_cm_rep ( *ep_ptr->cm_handle, &cm_rep ); @@ -1681,9 +1744,9 @@ dapls_ib_cr_handoff ( * Return the size of private data given a connection op type * * Input: - * hca_ptr hca pointer, needed for transport type * prd_ptr private data pointer * conn_op connection operation type + * hca_ptr hca pointer, needed for transport type * * If prd_ptr is NULL, this is a query for the max size supported by * the provider, otherwise it is the actual size of the private data @@ -1700,13 +1763,14 @@ dapls_ib_cr_handoff ( */ int dapls_ib_private_data_size ( - IN DAPL_HCA *hca_ptr, IN DAPL_PRIVATE *prd_ptr, - IN DAPL_PDATA_OP conn_op) + IN DAPL_PDATA_OP conn_op, + IN DAPL_HCA *hca_ptr) { int size; UNUSED_PARAM( prd_ptr ); + UNUSED_PARAM( hca_ptr ); switch (conn_op) { diff --git a/dapl/ibal/dapl_ibal_cq.c b/dapl/ibal/dapl_ibal_cq.c index 2ac5814..28de045 100644 --- a/dapl/ibal/dapl_ibal_cq.c +++ b/dapl/ibal/dapl_ibal_cq.c @@ -82,7 +82,9 @@ dapli_ibal_cq_async_error_callback ( IN ib_async_event_rec_t *p_err_rec ) /* maps to dapl_evd_cq_async_error_callback(), context is EVD */ evd_cb->pfn_async_cq_err_cb( (ib_hca_handle_t)p_ca, - (ib_error_record_t*)&p_err_rec->code, evd_ptr); + evd_ptr->ib_cq_handle, + (ib_error_record_t*)&p_err_rec->code, + evd_ptr ); } diff --git a/dapl/ibal/dapl_ibal_dto.h b/dapl/ibal/dapl_ibal_dto.h index 283fd91..4694072 100644 --- a/dapl/ibal/dapl_ibal_dto.h +++ b/dapl/ibal/dapl_ibal_dto.h @@ -146,19 +146,16 @@ dapls_ib_post_recv ( { return DAT_SUCCESS; } - else - { + dapl_dbg_log (DAPL_DBG_TYPE_EP, "--> DsPR: post_recv status %s\n", ib_get_err_str(ib_status)); - /* * Moving QP to error state; */ - ib_status = dapls_modify_qp_state_to_error ( ep_ptr->qp_handle); + (void) dapls_modify_qp_state_to_error ( ep_ptr->qp_handle); ep_ptr->qp_state = IB_QPS_ERROR; return (dapl_ib_status_convert (ib_status)); - } } @@ -185,6 +182,7 @@ dapls_ib_post_send ( if (ep_ptr->param.ep_state != DAT_EP_STATE_CONNECTED) { ib_qp_attr_t qp_attr; + ib_query_qp ( ep_ptr->qp_handle, &qp_attr ); dapl_dbg_log (DAPL_DBG_TYPE_ERR, "--> DsPS: !CONN EP (%p) ep_state=%d " @@ -274,7 +272,7 @@ dapls_ib_post_send ( /* * Moving QP to error state; */ - ib_status = dapls_modify_qp_state_to_error ( ep_ptr->qp_handle); + (void) dapls_modify_qp_state_to_error ( ep_ptr->qp_handle); ep_ptr->qp_state = IB_QPS_ERROR; return (dapl_ib_status_convert (ib_status)); diff --git a/dapl/ibal/dapl_ibal_extensions.c b/dapl/ibal/dapl_ibal_extensions.c index 48c0dfe..b05c0bb 100644 --- a/dapl/ibal/dapl_ibal_extensions.c +++ b/dapl/ibal/dapl_ibal_extensions.c @@ -194,10 +194,7 @@ dapli_post_ext( IN DAT_EP_HANDLE ep_handle, /* * Synchronization ok since this buffer is only used for send * requests, which aren't allowed to race with each other. - * only if completion is expected */ - if (!(DAT_COMPLETION_SUPPRESS_FLAG & flags)) { - dat_status = dapls_dto_cookie_alloc( &ep_ptr->req_buffer, DAPL_DTO_TYPE_EXTENSION, @@ -207,21 +204,13 @@ dapli_post_ext( IN DAT_EP_HANDLE ep_handle, if ( dat_status != DAT_SUCCESS ) { #ifdef DAPL_DBG - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - "%s() cookie alloc faulure %x\n", + dapl_dbg_log(DAPL_DBG_TYPE_ERR,"%s() cookie alloc faulure %x\n", __FUNCTION__,dat_status); #endif goto bail; } /* - * Take reference before posting to avoid race conditions with - * completions - */ - dapl_os_atomic_inc(&ep_ptr->req_count); - } - - /* * Invoke provider specific routine to post DTO */ dat_status = dapls_ib_post_ext_send(ep_ptr, @@ -237,9 +226,6 @@ dapli_post_ext( IN DAT_EP_HANDLE ep_handle, if (dat_status != DAT_SUCCESS) { - if ( cookie != NULL ) - { - dapl_os_atomic_dec(&ep_ptr->req_count); dapls_cookie_dealloc(&ep_ptr->req_buffer, cookie); #ifdef DAPL_DBG dapl_dbg_log(DAPL_DBG_TYPE_ERR, @@ -247,7 +233,6 @@ dapli_post_ext( IN DAT_EP_HANDLE ep_handle, __FUNCTION__,dat_status,__LINE__); #endif } - } bail: return dat_status; diff --git a/dapl/ibal/dapl_ibal_name_service.h b/dapl/ibal/dapl_ibal_name_service.h index e55eead..d322d71 100644 --- a/dapl/ibal/dapl_ibal_name_service.h +++ b/dapl/ibal/dapl_ibal_name_service.h @@ -58,8 +58,6 @@ dapli_ib_sa_query_cb ( IN ib_query_rec_t *p_query_rec ); -//DAT_RETURN dapls_ns_init (void); - #ifdef NO_NAME_SERVICE DAT_RETURN dapls_ns_lookup_address ( diff --git a/dapl/ibal/dapl_ibal_qp.c b/dapl/ibal/dapl_ibal_qp.c index 9f26c82..cc8c394 100644 --- a/dapl/ibal/dapl_ibal_qp.c +++ b/dapl/ibal/dapl_ibal_qp.c @@ -81,7 +81,7 @@ dapli_ib_qp_async_error_cb( IN ib_async_event_rec_t* p_err_rec ) if ((evd_cb == NULL) || (evd_cb->pfn_async_qp_err_cb == NULL)) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, - "--> DiQpAEC: no ERROR cb on %p found \n", p_ca); + "--> DiQpAEC: no ERROR cb on p_ca %p found\n", p_ca); return; } @@ -94,8 +94,9 @@ dapli_ib_qp_async_error_cb( IN ib_async_event_rec_t* p_err_rec ) /* maps to dapl_evd_qp_async_error_callback(), context is EP */ evd_cb->pfn_async_qp_err_cb( (ib_hca_handle_t)p_ca, + ep_ptr->qp_handle, (ib_error_record_t*)&p_err_rec->code, - ep_ptr); + ep_ptr ); } /* @@ -556,9 +557,9 @@ dapls_modify_qp_state_to_rtr ( qp_mod.state.rtr.primary_av.conn.path_mtu = p_port->p_attr->mtu; qp_mod.state.rtr.primary_av.conn.local_ack_timeout = 7; qp_mod.state.rtr.primary_av.conn.seq_err_retry_cnt = 7; - qp_mod.state.rtr.primary_av.conn.rnr_retry_cnt = 7; + qp_mod.state.rtr.primary_av.conn.rnr_retry_cnt = IB_RNR_RETRY_CNT; qp_mod.state.rtr.resp_res = 4; // in-flight RDMAs - qp_mod.state.rtr.rnr_nak_timeout = 7; + qp_mod.state.rtr.rnr_nak_timeout = IB_RNR_NAK_TIMEOUT; ib_status = ib_modify_qp (qp_handle, &qp_mod); @@ -579,7 +580,8 @@ dapls_modify_qp_state_to_rts ( ib_qp_handle_t qp_handle ) qp_mod.req_state = IB_QPS_RTS; qp_mod.state.rts.sq_psn = DAPL_IBAL_START_PSN; qp_mod.state.rts.retry_cnt = 7; - qp_mod.state.rts.rnr_retry_cnt = 6; + qp_mod.state.rts.rnr_retry_cnt = IB_RNR_RETRY_CNT; + qp_mod.state.rtr.rnr_nak_timeout = IB_RNR_NAK_TIMEOUT; qp_mod.state.rts.local_ack_timeout = 7; qp_mod.state.rts.init_depth = 4; diff --git a/dapl/ibal/dapl_ibal_util.c b/dapl/ibal/dapl_ibal_util.c index fe9e3b8..ad4acf0 100644 --- a/dapl/ibal/dapl_ibal_util.c +++ b/dapl/ibal/dapl_ibal_util.c @@ -29,7 +29,7 @@ #include "dapl_ring_buffer_util.h" #ifdef DAT_EXTENSIONS -#include +#include #endif #ifndef NO_NAME_SERVICE @@ -1542,6 +1542,63 @@ dapls_ib_setup_async_callback ( /* + * dapls_ib_query_gid + * + * Query the hca for the gid of the 1st active port. + * + * Input: + * hca_handl hca handle + * ep_attr attribute of the ep + * + * Output: + * none + * + * Returns: + * DAT_SUCCESS + * DAT_INVALID_HANDLE + * DAT_INVALID_PARAMETER + */ + +DAT_RETURN +dapls_ib_query_gid( IN DAPL_HCA *hca_ptr, + IN GID *gid ) +{ + dapl_ibal_ca_t *p_ca; + ib_ca_attr_t *p_hca_attr; + ib_api_status_t ib_status; + ib_hca_port_t port_num; + + p_ca = (dapl_ibal_ca_t *) hca_ptr->ib_hca_handle; + + if (p_ca == NULL) + { + dapl_dbg_log ( DAPL_DBG_TYPE_ERR, + "%s() invalid hca_ptr %p", __FUNCTION__, hca_ptr); + return DAT_INVALID_HANDLE; + } + + ib_status = ib_query_ca ( + p_ca->h_ca, + p_ca->p_ca_attr, + &p_ca->ca_attr_size); + if (ib_status != IB_SUCCESS) + { + dapl_dbg_log ( DAPL_DBG_TYPE_ERR, + "%s() ib_query_ca returned failed status = %s\n", + ib_get_err_str(ib_status)); + return dapl_ib_status_convert (ib_status); + } + + p_hca_attr = p_ca->p_ca_attr; + port_num = hca_ptr->port_num - 1; + + gid->gid_prefix = p_hca_attr->p_port_attr[port_num].p_gid_table->unicast.prefix; + gid->guid = p_hca_attr->p_port_attr[port_num].p_gid_table->unicast.interface_id; + return DAT_SUCCESS; +} + + +/* * dapls_ib_query_hca * * Query the hca attribute @@ -2201,7 +2258,9 @@ static DAT_NAMED_ATTR *ib_attrs = NULL; #define SPEC_ATTR_SIZE( x ) 0 #endif -void dapls_query_provider_specific_attr( IN DAT_PROVIDER_ATTR *attr_ptr ) +void dapls_query_provider_specific_attr( + IN DAPL_IA *ia_ptr, + IN DAT_PROVIDER_ATTR *attr_ptr ) { attr_ptr->num_provider_specific_attr = SPEC_ATTR_SIZE(ib_attrs); attr_ptr->provider_specific_attr = ib_attrs; diff --git a/dapl/ibal/dapl_ibal_util.h b/dapl/ibal/dapl_ibal_util.h index 033bc8d..52dd879 100644 --- a/dapl/ibal/dapl_ibal_util.h +++ b/dapl/ibal/dapl_ibal_util.h @@ -29,7 +29,7 @@ #include #ifdef DAT_EXTENSIONS -#include +#include #endif /* @@ -62,6 +62,18 @@ typedef void (*ib_async_handler_t)( IN ib_error_record_t *err_code, IN void *context); +typedef void (*ib_async_qp_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_qp_handle_t ib_qp_handle, + IN ib_error_record_t *err_code, + IN void *context); + +typedef void (*ib_async_cq_handler_t)( + IN ib_hca_handle_t ib_hca_handle, + IN ib_cq_handle_t ib_cq_handle, + IN ib_error_record_t *err_code, + IN void *context); + typedef ib_net64_t ib_guid_t; typedef ib_net16_t ib_lid_t; typedef boolean_t ib_bool_t; @@ -112,6 +124,34 @@ typedef struct _ib_hca_name #define IB_MAX_DREQ_PDATA_SIZE 220 #define IB_MAX_DREP_PDATA_SIZE 224 + +/* Resource Not Ready + 1-6 is an actual retry count which is decremented to zero before + an error condition is set. + 7 is 'magic' in that it implies Infinite retry, just keeps trying. +*/ +#define IB_RNR_RETRY_CNT 7 + +/* +IB 1.2 spec, page 331, table 45, RNR NAK timeout encoding (5-bits) + +00000=655.36ms(milliseconds) +00001=0.01ms +00010=0.02ms +00011=0.03ms +00100=0.04ms +00101=0.06ms +00110=0.08ms +00111=0.12ms + +11100=163.84ms 28d +11101=245.76ms 29d +11110=327.68ms 30d +11111=491.52ms 31d +*/ +#define IB_RNR_NAK_TIMEOUT 0 + + typedef void (*dapl_ibal_pfn_destructor_t)( IN void* context ); @@ -164,8 +204,8 @@ typedef struct _dapl_ibal_evd_cb { cl_list_item_t next; // peer CA list ib_async_handler_t pfn_async_err_cb; - ib_async_handler_t pfn_async_qp_err_cb; - ib_async_handler_t pfn_async_cq_err_cb; + ib_async_qp_handler_t pfn_async_qp_err_cb; + ib_async_cq_handler_t pfn_async_cq_err_cb; void *context; } dapl_ibal_evd_cb_t; @@ -370,6 +410,7 @@ struct ib_llist_entry }; #ifdef SOCK_CM + typedef enum { IB_THREAD_INIT, @@ -378,7 +419,22 @@ typedef enum IB_THREAD_EXIT } ib_thread_state_t; -#endif + +typedef enum scm_state +{ + SCM_INIT, + SCM_LISTEN, + SCM_CONN_PENDING, + SCM_ACCEPTING, + SCM_ACCEPTED, + SCM_REJECTED, + SCM_CONNECTED, + SCM_DISCONNECTED, + SCM_DESTROY + +} SCM_STATE; + +#endif /* SOCK_CM */ typedef struct _ib_hca_transport { @@ -409,29 +465,40 @@ typedef uint32_t ib_shm_transport_t; /* CM mappings use SOCKETS */ -/* destination info to exchange until real IB CM shows up */ +/* destination info exchanged between dapl, define wire protocol version */ +#define DSCM_VER 2 + typedef struct _ib_qp_cm { - ib_net32_t qpn; + ib_net16_t ver; + ib_net16_t rej; ib_net16_t lid; ib_net16_t port; + ib_net32_t qpn; ib_net32_t p_size; DAT_SOCK_ADDR6 ia_address; + GID gid; } ib_qp_cm_t; struct ib_cm_handle { struct ib_llist_entry entry; + DAPL_OS_LOCK lock; + SCM_STATE state; int socket; int l_socket; - struct dapl_hca *hca_ptr; - DAT_HANDLE cr; + struct dapl_hca *hca; DAT_HANDLE sp; + DAT_HANDLE cr; + struct dapl_ep *ep; ib_qp_cm_t dst; unsigned char p_data[256]; }; -#endif + +DAT_RETURN dapli_init_sock_cm ( IN DAPL_HCA *hca_ptr ); + +#endif /* SOCK_CM */ /* * Prototype From sean.hefty at intel.com Fri Jan 30 10:59:17 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 30 Jan 2009 10:59:17 -0800 Subject: [ofa-general] [PATCH 5/5] [DAPL] dapl/ibal-scm: update ibal-scm provider In-Reply-To: References: Message-ID: From: Stan Smith Update the dapl.git tree with the latest SVN version of the ibal-scm provider. Signed-off-by: Sean Hefty --- Actual codng changes were made by Stan. I'm just submitting the patch to update the DAPL git repository. dapl/ibal-scm/dapl_ibal-scm_cm.c | 775 +++++++++++++++++++++++++----------- dapl/ibal-scm/dapl_ibal-scm_util.c | 44 ++ 2 files changed, 584 insertions(+), 235 deletions(-) diff --git a/dapl/ibal-scm/dapl_ibal-scm_cm.c b/dapl/ibal-scm/dapl_ibal-scm_cm.c index df83008..6a050b8 100644 --- a/dapl/ibal-scm/dapl_ibal-scm_cm.c +++ b/dapl/ibal-scm/dapl_ibal-scm_cm.c @@ -63,6 +63,88 @@ #include #include +extern int g_scm_pipe[2]; + +extern DAT_RETURN +dapls_ib_query_gid( IN DAPL_HCA *hca_ptr, + IN GID *gid ); + + +static struct ib_cm_handle * dapli_cm_create(void) +{ + struct ib_cm_handle *cm_ptr; + + /* Allocate CM, init lock, and initialize */ + if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL) + return NULL; + + if (dapl_os_lock_init(&cm_ptr->lock)) + goto bail; + + (void)dapl_os_memzero(cm_ptr, sizeof(*cm_ptr)); + cm_ptr->dst.ver = htons(DSCM_VER); + cm_ptr->socket = -1; + cm_ptr->l_socket = -1; + return cm_ptr; +bail: + dapl_os_free(cm_ptr, sizeof(*cm_ptr)); + return NULL; +} + + +/* mark for destroy, remove all references, schedule cleanup */ + +static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr) +{ + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " cm_destroy: cm %p ep %p\n", cm_ptr,cm_ptr->ep); + + /* cleanup, never made it to work queue */ + if (cm_ptr->state == SCM_INIT) { + if (cm_ptr->socket >= 0) + closesocket(cm_ptr->socket); + if (cm_ptr->l_socket >= 0) + closesocket(cm_ptr->l_socket); + dapl_os_free(cm_ptr, sizeof(*cm_ptr)); + return; + } + + dapl_os_lock(&cm_ptr->lock); + cm_ptr->state = SCM_DESTROY; + if (cm_ptr->ep) + cm_ptr->ep->cm_handle = IB_INVALID_HANDLE; + + /* close socket if still active */ + if (cm_ptr->socket >= 0) { + closesocket(cm_ptr->socket); + cm_ptr->socket = -1; + } + if (cm_ptr->l_socket >= 0) { + closesocket(cm_ptr->l_socket); + cm_ptr->l_socket = -1; + } + dapl_os_unlock(&cm_ptr->lock); + + /* wakeup work thread */ + _write(g_scm_pipe[1], "w", sizeof "w"); +} + + +/* queue socket for processing CM work */ +static void dapli_cm_queue(struct ib_cm_handle *cm_ptr) +{ + /* add to work queue for cr thread processing */ + dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&cm_ptr->entry); + dapl_os_lock( &cm_ptr->hca->ib_trans.lock ); + dapl_llist_add_tail((DAPL_LLIST_HEAD*)&cm_ptr->hca->ib_trans.list, + (DAPL_LLIST_ENTRY*)&cm_ptr->entry, (void*)cm_ptr); + dapl_os_unlock(&cm_ptr->hca->ib_trans.lock); + + /* wakeup CM work thread */ + _write(g_scm_pipe[1], "w", sizeof "w"); +} + + static uint16_t dapli_get_lid(IN DAPL_HCA *hca, IN int port) @@ -123,6 +205,263 @@ dapli_get_lid(IN DAPL_HCA *hca, IN int port) /* + * ACTIVE/PASSIVE: called from CR thread or consumer via ep_disconnect + */ +static DAT_RETURN +dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr) +{ + DAPL_EP *ep_ptr = cm_ptr->ep; + DAT_UINT32 disc_data = htonl(0xdead); + + if (ep_ptr == NULL) + return DAT_SUCCESS; + + dapl_os_lock(&cm_ptr->lock); + if ((cm_ptr->state == SCM_INIT) || + (cm_ptr->state == SCM_DISCONNECTED)) { + dapl_os_unlock(&cm_ptr->lock); + return DAT_SUCCESS; + } else { + /* send disc date, close socket, schedule destroy */ + if (cm_ptr->socket >= 0) { + send(cm_ptr->socket, (const char *)&disc_data, + sizeof(disc_data), 0); + closesocket(cm_ptr->socket); + cm_ptr->socket = -1; + } + cm_ptr->state = SCM_DISCONNECTED; + _write(g_scm_pipe[1], "w", sizeof "w"); + } + dapl_os_unlock(&cm_ptr->lock); + + + if (ep_ptr->cr_ptr) { + dapls_cr_callback(cm_ptr, + IB_CME_DISCONNECTED, + NULL, + ((DAPL_CR *)ep_ptr->cr_ptr)->sp_ptr); + } else { + dapl_evd_connection_callback(ep_ptr->cm_handle, + IB_CME_DISCONNECTED, + NULL, + ep_ptr); + } + + /* remove reference from endpoint */ + ep_ptr->cm_handle = NULL; + + /* schedule destroy */ + + + return DAT_SUCCESS; +} + + + +/* + * PASSIVE: consumer accept, send local QP information, private data, + * queue on work thread to receive RTU information to avoid blocking + * user thread. + */ +static DAT_RETURN +dapli_socket_accept_usr( DAPL_EP *ep_ptr, + DAPL_CR *cr_ptr, + DAT_COUNT p_size, + DAT_PVOID p_data ) +{ + DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; + dp_ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; + WSABUF iovec[2]; + int len, rc; + short rtu_data = 0; + ib_api_status_t ibs; + ib_qp_attr_t qpa; + dapl_ibal_port_t *p_port; + dapl_ibal_ca_t *p_ca; + + dapl_dbg_log (DAPL_DBG_TYPE_EP, "%s() p_sz %d sock %d port 0x%x\n", + __FUNCTION__,p_size,cm_ptr->socket, + ia_ptr->hca_ptr->port_num); + + if (p_size > IB_MAX_REP_PDATA_SIZE) + return DAT_LENGTH_ERROR; + + /* must have a accepted socket */ + if ( cm_ptr->socket < 0 ) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + "%s() Not accepted socket? remote port=0x%x lid=0x%x" + " qpn=0x%x psize=%d\n", + cm_ptr->dst.port, cm_ptr->dst.lid, + ntohs(cm_ptr->dst.qpn), cm_ptr->dst.p_size); + return DAT_INTERNAL_ERROR; + } + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " accept_usr: remote port=0x%x lid=0x%x" + " qpn=0x%x psize=%d\n", + cm_ptr->dst.port, cm_ptr->dst.lid, + ntohs(cm_ptr->dst.qpn), cm_ptr->dst.p_size); + + /* modify QP to RTR and then to RTS with remote info already read */ + + p_ca = (dapl_ibal_ca_t *) ia_ptr->hca_ptr->ib_hca_handle; + p_port = dapli_ibal_get_port (p_ca, (uint8_t)ia_ptr->hca_ptr->port_num); + if (!p_port) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "%s() dapli_ibal_get_port() failed @ line #%d\n", + __FUNCTION__,__LINE__); + goto bail; + } + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + "%s() DST: qpn 0x%x port 0x%x lid %x psize %d\n", + __FUNCTION__, + cl_ntoh32(cm_ptr->dst.qpn), + cm_ptr->dst.port, + cl_ntoh16(cm_ptr->dst.lid), cm_ptr->dst.p_size); + + /* modify QP to RTR and then to RTS with remote info */ + + ibs = dapls_modify_qp_state_to_rtr( ep_ptr->qp_handle, + cm_ptr->dst.qpn, + cm_ptr->dst.lid, + p_port ); + if (ibs != IB_SUCCESS) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "%s() QP --> RTR failed @ line #%d\n", + __FUNCTION__,__LINE__); + goto bail; + } + + if ( dapls_modify_qp_state_to_rts( ep_ptr->qp_handle ) ) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "%s() QP --> RTS failed @ line #%d\n", + __FUNCTION__,__LINE__); + goto bail; + } + + ep_ptr->qp_state = IB_QP_STATE_RTS; + + /* save remote address information */ + dapl_os_memcpy( &ep_ptr->remote_ia_address, + &cm_ptr->dst.ia_address, + sizeof(ep_ptr->remote_ia_address)); + + /* determine QP & port numbers */ + ibs = ib_query_qp(ep_ptr->qp_handle, &qpa); + if (ibs != IB_SUCCESS) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " ib_query_qp() ERR %s\n", ib_get_err_str(ibs)); + goto bail; + } + + /* Send our QP info, IA address, and private data */ + cm_ptr->dst.qpn = qpa.num; /* ib_net32_t */ + cm_ptr->dst.port = ia_ptr->hca_ptr->port_num; + cm_ptr->dst.lid = dapli_get_lid(ia_ptr->hca_ptr, ia_ptr->hca_ptr->port_num); + /* set gid in network order */ + ibs = dapls_ib_query_gid( ia_ptr->hca_ptr, &cm_ptr->dst.gid ); + if ( ibs != IB_SUCCESS ) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "%s() dapls_ib_query_gid() returns '%s'\n", + __FUNCTION__,ib_get_err_str(ibs)); + goto bail; + } + + cm_ptr->dst.ia_address = ia_ptr->hca_ptr->hca_address; + cm_ptr->dst.p_size = p_size; + + dapl_dbg_log(DAPL_DBG_TYPE_CM, + "%s()\n Tx QP info: qpn %x port 0x%x lid 0x%x p_sz %d IP %s\n", + __FUNCTION__, cl_ntoh32(cm_ptr->dst.qpn), cm_ptr->dst.port, + cl_ntoh16(cm_ptr->dst.lid), cm_ptr->dst.p_size, + dapli_get_ip_addr_str(&cm_ptr->dst.ia_address,NULL)); + + /* network byte-ordering - QPN & LID already are */ + cm_ptr->dst.p_size = cl_hton32(cm_ptr->dst.p_size); + cm_ptr->dst.port = cl_hton16(cm_ptr->dst.port); + + iovec[0].buf = (char*)&cm_ptr->dst; + iovec[0].len = sizeof(ib_qp_cm_t); + if (p_size) { + iovec[1].buf = p_data; + iovec[1].len = p_size; + } + rc = WSASend( cm_ptr->socket, iovec, (p_size ? 2:1), &len, 0, 0, 0 ); + if (rc || len != (p_size + sizeof(ib_qp_cm_t))) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept_usr: ERR %d, wcnt=%d\n", + WSAGetLastError(), len); + goto bail; + } + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " accept_usr: local port=0x%x lid=0x%x" + " qpn=0x%x psize=%d\n", + ntohs(cm_ptr->dst.port), ntohs(cm_ptr->dst.lid), + ntohl(cm_ptr->dst.qpn), ntohl(cm_ptr->dst.p_size)); + + /* save state and reference to EP, queue for RTU data */ + cm_ptr->ep = ep_ptr; + cm_ptr->hca = ia_ptr->hca_ptr; + cm_ptr->state = SCM_ACCEPTED; + + /* restore remote address information for query */ + dapl_os_memcpy( &cm_ptr->dst.ia_address, + &ep_ptr->remote_ia_address, + sizeof(cm_ptr->dst.ia_address)); + + dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: accepted!\n" ); + dapli_cm_queue(cm_ptr); + + return DAT_SUCCESS; + +bail: + dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept_usr: ERR !QP_RTR_RTS \n"); + dapli_cm_destroy(cm_ptr); + dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ + + return DAT_INTERNAL_ERROR; +} + + +/* + * PASSIVE: read RTU from active peer, post CONN event + */ +void +dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr) +{ + int len; + short rtu_data = 0; + + /* complete handshake after final QP state change */ + len = recv(cm_ptr->socket, (char*)&rtu_data, sizeof(rtu_data), 0); + if ( len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f ) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + " accept_rtu: ERR %d, rcnt=%d rdata=%x\n", + WSAGetLastError(), len, ntohs(rtu_data) ); + goto bail; + } + + /* save state and reference to EP, queue for disc event */ + cm_ptr->state = SCM_CONNECTED; + + /* final data exchange if remote QP state is good to go */ + dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: connected!\n" ); + dapls_cr_callback(cm_ptr, IB_CME_CONNECTED, NULL, cm_ptr->sp); + return; +bail: + dapls_ib_reinit_ep(cm_ptr->ep); /* reset QP state */ + dapli_cm_destroy(cm_ptr); + dapls_cr_callback(cm_ptr, IB_CME_DESTINATION_REJECT, NULL, cm_ptr->sp); +} + + +/* * ACTIVE: Create socket, connect, and exchange QP information */ static DAT_RETURN @@ -143,21 +482,16 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, dapl_ibal_port_t *p_port; dapl_ibal_ca_t *p_ca; - dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual %d\n", r_qual); + dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual %d psize %d\n", + r_qual, p_size); - /* - * Allocate CM and initialize - */ - if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL ) { + cm_ptr = dapli_cm_create(); + if (cm_ptr == NULL) return DAT_INSUFFICIENT_RESOURCES; - } - - (void) dapl_os_memzero( cm_ptr, sizeof(*cm_ptr) ); - cm_ptr->socket = -1; /* create, connect, sockopt, and exchange QP information */ if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) < 0 ) { - dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + dapli_cm_destroy(cm_ptr); return DAT_INSUFFICIENT_RESOURCES; } @@ -166,7 +500,7 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, if (connect(cm_ptr->socket, r_addr, sizeof(*r_addr)) == SOCKET_ERROR) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, " connect: %d on r_qual %d\n", WSAGetLastError(), (unsigned int)r_qual); - dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + dapli_cm_destroy(cm_ptr); return DAT_INVALID_ADDRESS; } @@ -175,6 +509,8 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, (const char*)&opt, sizeof(opt) ); + dapl_dbg_log(DAPL_DBG_TYPE_EP, " socket connected!\n"); + /* determine QP & port numbers */ ibs = ib_query_qp(ep_ptr->qp_handle, &qpa); if (ibs != IB_SUCCESS) @@ -187,7 +523,6 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, /* Send QP info, IA address and private data */ cm_ptr->dst.qpn = qpa.num; /* ib_net32_t */ cm_ptr->dst.port = cl_hton16(ia_ptr->hca_ptr->port_num); - cm_ptr->dst.lid = dapli_get_lid( ia_ptr->hca_ptr, ia_ptr->hca_ptr->port_num ); if (cm_ptr->dst.lid == 0) @@ -197,6 +532,17 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, __FUNCTION__, __LINE__); goto bail; } + + /* set gid in network order */ + ibs = dapls_ib_query_gid( ia_ptr->hca_ptr, &cm_ptr->dst.gid ); + if ( ibs != IB_SUCCESS ) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "%s() dapls_ib_query_gid() returns '%s'\n", + __FUNCTION__,ib_get_err_str(ibs)); + goto bail; + } + cm_ptr->dst.ia_address = ia_ptr->hca_ptr->hca_address; cm_ptr->dst.p_size = cl_hton32(p_size); @@ -213,6 +559,8 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, iovec[1].buf = p_data; iovec[1].len = p_size; } + + dapl_dbg_log(DAPL_DBG_TYPE_EP," socket connected, write QP and private data\n"); rc = WSASend (cm_ptr->socket, iovec, (p_size ? 2:1), &len, 0, 0, NULL); if ( rc || len != (p_size + sizeof(ib_qp_cm_t)) ) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, @@ -225,17 +573,65 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, cm_ptr->dst.port, cm_ptr->dst.lid, cm_ptr->dst.qpn, cm_ptr->dst.p_size ); + /* queue up to work thread to avoid blocking consumer */ + cm_ptr->state = SCM_CONN_PENDING; + cm_ptr->hca = ia_ptr->hca_ptr; + cm_ptr->ep = ep_ptr; + dapli_cm_queue(cm_ptr); + return DAT_SUCCESS; +bail: + /* close socket, free cm structure */ + dapli_cm_destroy(cm_ptr); + return DAT_INTERNAL_ERROR; +} + + +/* + * ACTIVE: exchange QP information, called from CR thread + */ +void +dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) +{ + DAPL_EP *ep_ptr = cm_ptr->ep; + DAPL_IA *ia_ptr = cm_ptr->ep->header.owner_ia; + int len, rc; + DWORD ioflags; + WSABUF iovec[1]; + short rtu_data = htons(0x0E0F); + ib_cm_events_t event = IB_CME_DESTINATION_REJECT; + ib_api_status_t ibs; + dapl_ibal_port_t *p_port; + dapl_ibal_ca_t *p_ca; + /* read DST information into cm_ptr, overwrite SRC info */ + dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: recv peer QP data\n"); + + iovec[0].buf = (char*)&cm_ptr->dst; + iovec[0].len = sizeof(ib_qp_cm_t); ioflags = len = 0; rc = WSARecv (cm_ptr->socket, iovec, 1, &len, &ioflags, 0, 0); - if ( rc == SOCKET_ERROR || len != sizeof(ib_qp_cm_t) ) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR,"connect read: ERR %d rcnt=%d\n", - WSAGetLastError(), len); + if ( rc == SOCKET_ERROR || len != sizeof(ib_qp_cm_t) || + ntohs(cm_ptr->dst.ver) != DSCM_VER ) + { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "connect_rtu read: ERR %d rcnt=%d ver=%d\n", + WSAGetLastError(), len, cm_ptr->dst.ver); + goto bail; + } + + /* check for consumer reject */ + if (cm_ptr->dst.rej) { + dapl_dbg_log(DAPL_DBG_TYPE_CM, + " connect_rtu read: PEER REJ reason=0x%x\n", + ntohs(cm_ptr->dst.rej)); + event = IB_CME_DESTINATION_REJECT_PRIVATE_DATA; goto bail; } - /* revert back to host byte ordering */ + /* convert peer response values to host order */ cm_ptr->dst.port = cl_ntoh16(cm_ptr->dst.port); + cm_ptr->dst.lid = ntohs(cm_ptr->dst.lid); + cm_ptr->dst.qpn = cm_ptr->dst.qpn; cm_ptr->dst.p_size = cl_ntoh32(cm_ptr->dst.p_size); dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: Rx DST: qpn %x port %d " @@ -245,15 +641,27 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, cm_ptr->dst.p_size, dapli_get_ip_addr_str(&cm_ptr->dst.ia_address,NULL)); + /* save remote address information */ + dapl_os_memcpy( &ep_ptr->remote_ia_address, + &cm_ptr->dst.ia_address, + sizeof(ep_ptr->remote_ia_address)); + + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " connect_rtu: DST %s port=0x%x lid=0x%x, qpn=0x%x, psize=%d\n", + inet_ntoa(((struct sockaddr_in *)&cm_ptr->dst.ia_address)->sin_addr), + cm_ptr->dst.port, cm_ptr->dst.lid, + cm_ptr->dst.qpn, cm_ptr->dst.p_size); + /* validate private data size before reading */ - if ( cm_ptr->dst.p_size > IB_MAX_REP_PDATA_SIZE ) { + if (cm_ptr->dst.p_size > IB_MAX_REP_PDATA_SIZE) { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " connect read: psize (%d) wrong\n", + " connect_rtu read: psize (%d) wrong\n", cm_ptr->dst.p_size ); goto bail; } /* read private data into cm_handle if any present */ + dapl_dbg_log(DAPL_DBG_TYPE_EP," socket connected, read private data\n"); if ( cm_ptr->dst.p_size ) { iovec[0].buf = cm_ptr->p_data; iovec[0].len = cm_ptr->dst.p_size; @@ -300,32 +708,29 @@ dapli_socket_connect ( DAPL_EP *ep_ptr, ep_ptr->qp_state = IB_QP_STATE_RTS; + dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: send RTU\n"); + /* complete handshake after final QP state change */ send(cm_ptr->socket, (const char *)&rtu_data, sizeof(rtu_data), 0); /* init cm_handle and post the event with private data */ ep_ptr->cm_handle = cm_ptr; + cm_ptr->state = SCM_CONNECTED; dapl_dbg_log( DAPL_DBG_TYPE_EP," ACTIVE: connected!\n" ); dapl_evd_connection_callback( ep_ptr->cm_handle, IB_CME_CONNECTED, cm_ptr->p_data, ep_ptr ); - return DAT_SUCCESS; - + return; bail: /* close socket, free cm structure and post error event */ - if ( cm_ptr->socket >= 0 ) - closesocket(cm_ptr->socket); - - dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); - dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ - - dapl_evd_connection_callback( ep_ptr->cm_handle, - IB_CME_LOCAL_FAILURE, + dapli_cm_destroy(cm_ptr); + dapls_ib_reinit_ep(ep_ptr); /* reset QP state */ + dapl_evd_connection_callback( NULL /*ep_ptr->cm_handle*/, + event, NULL, ep_ptr ); - return DAT_INTERNAL_ERROR; } @@ -347,14 +752,12 @@ dapli_socket_listen ( DAPL_IA *ia_ptr, ia_ptr, serviceID, sp_ptr); /* Allocate CM and initialize */ - if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL) + cm_ptr = dapli_cm_create(); + if (cm_ptr == NULL) return DAT_INSUFFICIENT_RESOURCES; - (void) dapl_os_memzero( cm_ptr, sizeof( *cm_ptr ) ); - - cm_ptr->socket = cm_ptr->l_socket = -1; cm_ptr->sp = sp_ptr; - cm_ptr->hca_ptr = ia_ptr->hca_ptr; + cm_ptr->hca = ia_ptr->hca_ptr; /* bind, listen, set sockopt, accept, exchange data */ if ((cm_ptr->l_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { @@ -406,12 +809,9 @@ dapli_socket_listen ( DAPL_IA *ia_ptr, /* set cm_handle for this service point, save listen socket */ sp_ptr->cm_srvc_handle = cm_ptr; - /* add to SP->CR thread list */ - dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&cm_ptr->entry); - dapl_os_lock( &cm_ptr->hca_ptr->ib_trans.lock ); - dapl_llist_add_tail((DAPL_LLIST_HEAD*)&cm_ptr->hca_ptr->ib_trans.list, - (DAPL_LLIST_ENTRY*)&cm_ptr->entry, (void*)cm_ptr); - dapl_os_unlock(&cm_ptr->hca_ptr->ib_trans.lock); + /* queue up listen socket to process inbound CR's */ + cm_ptr->state = SCM_LISTEN; + dapli_cm_queue(cm_ptr); dapl_dbg_log( DAPL_DBG_TYPE_CM, " listen: qual 0x%x cr %p s_fd %d\n", @@ -421,10 +821,7 @@ dapli_socket_listen ( DAPL_IA *ia_ptr, bail: dapl_dbg_log( DAPL_DBG_TYPE_CM, " listen: ERROR on conn_qual 0x%x\n",serviceID); - if ( cm_ptr->l_socket >= 0 ) - closesocket( cm_ptr->l_socket ); - - dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); + dapli_cm_destroy(cm_ptr); return dat_status; } @@ -441,6 +838,8 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) int len; DAT_RETURN dat_status = DAT_SUCCESS; + dapl_dbg_log(DAPL_DBG_TYPE_EP," socket_accept\n"); + /* Allocate accept CM and initialize */ if ((acm_ptr = dapl_os_alloc(sizeof(*acm_ptr))) == NULL) return DAT_INSUFFICIENT_RESOURCES; @@ -448,8 +847,9 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) (void) dapl_os_memzero( acm_ptr, sizeof( *acm_ptr ) ); acm_ptr->socket = -1; + acm_ptr->l_socket = -1; acm_ptr->sp = cm_ptr->sp; - acm_ptr->hca_ptr = cm_ptr->hca_ptr; + acm_ptr->hca = cm_ptr->hca; len = sizeof(acm_ptr->dst.ia_address); acm_ptr->socket = accept(cm_ptr->l_socket, @@ -464,27 +864,32 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) goto bail; } + dapl_dbg_log(DAPL_DBG_TYPE_EP," socket accepted, read QP data\n"); + /* read in DST QP info, IA address. check for private data */ len = recv(acm_ptr->socket,(char*)&acm_ptr->dst,sizeof(ib_qp_cm_t),0); - if ( len != sizeof(ib_qp_cm_t) ) { + if ( len != sizeof(ib_qp_cm_t) || ntohs(acm_ptr->dst.ver) != DSCM_VER ) + { dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " accept read: ERR %d, rcnt=%d\n", - WSAGetLastError(), len); + " accept read: ERR %d, rcnt=%d ver=%d\n", + WSAGetLastError(), len, acm_ptr->dst.ver); dat_status = DAT_INTERNAL_ERROR; goto bail; } - /* revert back to host byte ordering */ + /* convert accepted values to host byte ordering */ acm_ptr->dst.port = cl_ntoh16(acm_ptr->dst.port); + acm_ptr->dst.lid = ntohs(acm_ptr->dst.lid); + acm_ptr->dst.qpn = acm_ptr->dst.qpn; acm_ptr->dst.p_size = cl_ntoh32(acm_ptr->dst.p_size); - dapl_dbg_log(DAPL_DBG_TYPE_EP, " accept: DST sizeof(ib_cm_t) %d qpn %x " - "port %d lid 0x%x psize %d IP %s\n", - sizeof(ib_qp_cm_t), - cl_ntoh32(acm_ptr->dst.qpn), acm_ptr->dst.port, + dapl_dbg_log(DAPL_DBG_TYPE_EP, " accept: DST %s port 0x%x " + "lid 0x%x qpn 0x%x psize %d\n", + dapli_get_ip_addr_str(&acm_ptr->dst.ia_address,NULL), + acm_ptr->dst.port, cl_ntoh16(acm_ptr->dst.lid), - acm_ptr->dst.p_size, - dapli_get_ip_addr_str(&acm_ptr->dst.ia_address,NULL)); + cl_ntoh32(acm_ptr->dst.qpn), + acm_ptr->dst.p_size); /* validate private data size before reading */ if ( acm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE ) { @@ -495,6 +900,8 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) goto bail; } + dapl_dbg_log(DAPL_DBG_TYPE_EP," socket accepted, read private data\n"); + /* read private data into cm_handle if any present */ if ( acm_ptr->dst.p_size ) { len = recv( acm_ptr->socket, @@ -514,6 +921,8 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) p_data = acm_ptr->p_data; } + acm_ptr->state = SCM_ACCEPTING; + /* trigger CR event and return SUCCESS */ dapls_cr_callback( acm_ptr, IB_CME_CONNECTION_REQUEST_PENDING, @@ -521,153 +930,8 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) acm_ptr->sp ); return DAT_SUCCESS; - -bail: - if ( acm_ptr->socket >= 0 ) - closesocket( acm_ptr->socket ); - - dapl_os_free( acm_ptr, sizeof( *acm_ptr ) ); - return DAT_INTERNAL_ERROR; -} - - -static DAT_RETURN -dapli_socket_accept_final( DAPL_EP *ep_ptr, - DAPL_CR *cr_ptr, - DAT_COUNT p_size, - DAT_PVOID p_data ) -{ - DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; - dp_ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; - ib_qp_cm_t qp_cm; - WSABUF iovec[2]; - int len, rc; - short rtu_data = 0; - ib_api_status_t ibs; - ib_qp_attr_t qpa; - dapl_ibal_port_t *p_port; - dapl_ibal_ca_t *p_ca; - - dapl_dbg_log (DAPL_DBG_TYPE_EP, "%s() p_sz %d sock %d port %d\n", - __FUNCTION__,p_size,cm_ptr->socket, - ia_ptr->hca_ptr->port_num); - - if (p_size > IB_MAX_REP_PDATA_SIZE) - return DAT_LENGTH_ERROR; - - /* must have a accepted socket */ - if ( cm_ptr->socket < 0 ) - return DAT_INTERNAL_ERROR; - - /* modify QP to RTR and then to RTS with remote info already read */ - - p_ca = (dapl_ibal_ca_t *) ia_ptr->hca_ptr->ib_hca_handle; - p_port = dapli_ibal_get_port (p_ca, (uint8_t)ia_ptr->hca_ptr->port_num); - if (!p_port) - { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - "%s() dapli_ibal_get_port() failed @ line #%d\n", - __FUNCTION__,__LINE__); - goto bail; - } - - dapl_dbg_log(DAPL_DBG_TYPE_EP, "%s() DST: qpn %x port %d lid %x\n", - __FUNCTION__, - cl_ntoh32(cm_ptr->dst.qpn), - cm_ptr->dst.port, - cl_ntoh16(cm_ptr->dst.lid)); - - /* modify QP to RTR and then to RTS with remote info */ - - ibs = dapls_modify_qp_state_to_rtr( ep_ptr->qp_handle, - cm_ptr->dst.qpn, - cm_ptr->dst.lid, - p_port ); - if (ibs != IB_SUCCESS) - { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - "%s() QP --> RTR failed @ line #%d\n", - __FUNCTION__,__LINE__); - goto bail; - } - - if ( dapls_modify_qp_state_to_rts( ep_ptr->qp_handle ) ) - { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - "%s() QP --> RTS failed @ line #%d\n", - __FUNCTION__,__LINE__); - goto bail; - } - - ep_ptr->qp_state = IB_QP_STATE_RTS; - - /* determine QP & port numbers */ - ibs = ib_query_qp(ep_ptr->qp_handle, &qpa); - if (ibs != IB_SUCCESS) - { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " ib_query_qp() ERR %s\n", ib_get_err_str(ibs)); - goto bail; - } - - /* Send QP info, IA address, and private data */ - qp_cm.qpn = qpa.num; /* ib_net32_t */ - qp_cm.port = ia_ptr->hca_ptr->port_num; - qp_cm.lid = dapli_get_lid( ia_ptr->hca_ptr, ia_ptr->hca_ptr->port_num ); - qp_cm.ia_address = ia_ptr->hca_ptr->hca_address; - qp_cm.p_size = p_size; - - dapl_dbg_log(DAPL_DBG_TYPE_CM, - "%s()\n Tx QP info: qpn %x port %d lid 0x%x p_sz %d IP %s\n", - __FUNCTION__, cl_ntoh32(qp_cm.qpn), qp_cm.port, - cl_ntoh16(qp_cm.lid), qp_cm.p_size, - dapli_get_ip_addr_str(&qp_cm.ia_address,NULL)); - - /* network byte-ordering - QPN & LID already are */ - qp_cm.p_size = cl_hton32(qp_cm.p_size); - qp_cm.port = cl_hton16(qp_cm.port); - - iovec[0].buf = (char*)&qp_cm; - iovec[0].len = sizeof(ib_qp_cm_t); - if (p_size) { - iovec[1].buf = p_data; - iovec[1].len = p_size; - } - rc = WSASend( cm_ptr->socket, iovec, (p_size ? 2:1), &len, 0, 0, 0 ); - if (rc || len != (p_size + sizeof(ib_qp_cm_t))) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " accept_final: ERR %d, wcnt=%d\n", - WSAGetLastError(), len); - goto bail; - } - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " accept_final: SRC qpn %x port %d lid 0x%x psize %d\n", - qp_cm.qpn, qp_cm.port, qp_cm.lid, qp_cm.p_size ); - - /* complete handshake after final QP state change */ - len = recv(cm_ptr->socket, (char*)&rtu_data, sizeof(rtu_data), 0); - if ( len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f ) { - dapl_dbg_log(DAPL_DBG_TYPE_ERR, - " accept_final: ERR %d, rcnt=%d rdata=%x\n", - WSAGetLastError(), len, ntohs(rtu_data) ); - goto bail; - } - - /* final data exchange if remote QP state is good to go */ - dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: connected!\n" ); - - dapls_cr_callback ( cm_ptr, IB_CME_CONNECTED, NULL, cm_ptr->sp ); - - return DAT_SUCCESS; - bail: - dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept_final: ERR !QP_RTR_RTS \n"); - if ( cm_ptr->socket >= 0 ) - closesocket( cm_ptr->socket ); - - dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); - dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ - + dapli_cm_destroy(acm_ptr); return DAT_INTERNAL_ERROR; } @@ -747,11 +1011,7 @@ dapls_ib_disconnect ( dapl_dbg_log (DAPL_DBG_TYPE_EP, "dapls_ib_disconnect(ep_handle %p ....)\n", ep_ptr); - if ( cm_ptr->socket >= 0 ) { - closesocket( cm_ptr->socket ); - cm_ptr->socket = -1; - } - +#if 0 // XXX /* disconnect QP ala transition to RESET state */ ib_status = dapls_modify_qp_state_to_reset (ep_ptr->qp_handle); @@ -776,15 +1036,18 @@ dapls_ib_disconnect ( NULL, ep_ptr ); ep_ptr->cm_handle = NULL; - dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); } - +#endif /* modify QP state --> INIT */ dapls_ib_reinit_ep(ep_ptr); + if (cm_ptr == NULL) return DAT_SUCCESS; + else + return dapli_socket_disconnect(cm_ptr); } + /* * dapls_ib_disconnect_clean * @@ -874,14 +1137,20 @@ dapls_ib_remove_conn_listener ( if ( cm_ptr != NULL ) { if ( cm_ptr->l_socket >= 0 ) { closesocket( cm_ptr->l_socket ); + cm_ptr->l_socket = -1; + } + if ( cm_ptr->socket >= 0 ) { + closesocket( cm_ptr->socket ); cm_ptr->socket = -1; } /* cr_thread will free */ sp_ptr->cm_srvc_handle = NULL; + _write(g_scm_pipe[1], "w", sizeof "w"); } return DAT_SUCCESS; } + /* * dapls_ib_accept_connection * @@ -928,7 +1197,7 @@ dapls_ib_accept_connection ( return status; } - return ( dapli_socket_accept_final(ep_ptr, cr_ptr, p_size, p_data) ); + return ( dapli_socket_accept_usr(ep_ptr, cr_ptr, p_size, p_data) ); } @@ -948,27 +1217,39 @@ dapls_ib_accept_connection ( * DAT_INTERNAL_ERROR * */ + DAT_RETURN dapls_ib_reject_connection ( - IN dp_ib_cm_handle_t ib_cm_handle, + IN dp_ib_cm_handle_t cm_ptr, IN int reject_reason, - IN DAT_COUNT private_data_size, - IN const DAT_PVOID private_data) + IN DAT_COUNT psize, + IN const DAT_PVOID pdata) { - ib_cm_srvc_handle_t cm_ptr = ib_cm_handle; + WSABUF iovec[1]; + int len; dapl_dbg_log (DAPL_DBG_TYPE_EP, - "dapls_ib_reject_connection(cm_handle %p reason %x)\n", - ib_cm_handle, reject_reason ); - - /* just close the socket and return */ - if ( cm_ptr->socket > 0 ) { - closesocket( cm_ptr->socket ); + " reject(cm %p reason %x pdata %p psize %d)\n", + cm_ptr, reject_reason, pdata, psize ); + + /* write reject data to indicate reject */ + if (cm_ptr->socket >= 0) { + cm_ptr->dst.rej = (uint16_t)reject_reason; + cm_ptr->dst.rej = cl_hton16(cm_ptr->dst.rej); + iovec[0].buf = (char*)&cm_ptr->dst; + iovec[0].len = sizeof(ib_qp_cm_t); + (void) WSASend (cm_ptr->socket, iovec, 1, &len, 0, 0, NULL); + closesocket(cm_ptr->socket); cm_ptr->socket = -1; } + + /* cr_thread will destroy CR */ + cm_ptr->state = SCM_REJECTED; + _write(g_scm_pipe[1], "w", sizeof "w"); return DAT_SUCCESS; } + /* * dapls_ib_cm_remote_addr * @@ -1157,7 +1438,7 @@ dapls_ib_get_dat_event ( /* - * dapls_ib_get_dat_event + * dapls_ib_get_cm_event * * Return a DAT connection event given a provider CM event. * @@ -1189,12 +1470,16 @@ dapls_ib_get_cm_event ( } #endif /* NOT_USED */ -/* async CR processing thread to avoid blocking applications */ +/* outbound/inbound CR processing thread to avoid blocking applications */ + +#define SCM_MAX_CONN (8 * sizeof(fd_set)) + void cr_thread(void *arg) { struct dapl_hca *hca_ptr = arg; ib_cm_srvc_handle_t cr, next_cr; int max_fd, rc; + char rbuf[2]; fd_set rfd, rfds; struct timeval to; @@ -1202,10 +1487,12 @@ void cr_thread(void *arg) dapl_os_lock( &hca_ptr->ib_trans.lock ); hca_ptr->ib_trans.cr_state = IB_THREAD_RUN; + while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { FD_ZERO( &rfds ); - max_fd = -1; + FD_SET(g_scm_pipe[0], &rfds); + max_fd = g_scm_pipe[0]; if (!dapl_llist_is_empty((DAPL_LLIST_HEAD*)&hca_ptr->ib_trans.list)) next_cr = dapl_llist_peek_head((DAPL_LLIST_HEAD*) @@ -1230,32 +1517,46 @@ void cr_thread(void *arg) continue; } - FD_SET( cr->l_socket, &rfds ); /* add to select set */ - if ( cr->l_socket > max_fd ) + if (cr->socket > SCM_MAX_CONN-1) { + dapl_dbg_log(DAPL_DBG_TYPE_ERR, + "SCM ERR: cr->socket(%d) exceeded FD_SETSIZE %d\n", + cr->socket,SCM_MAX_CONN-1); + continue; + } + FD_SET( cr->socket, &rfds ); /* add to select SET */ + if ( cr->socket > max_fd ) max_fd = cr->l_socket; /* individual select poll to check for work */ FD_ZERO(&rfd); - FD_SET(cr->l_socket, &rfd); + FD_SET(cr->socket, &rfd); dapl_os_unlock(&hca_ptr->ib_trans.lock); to.tv_sec = 0; to.tv_usec = 0; /* wakeup and check destroy */ /* block waiting for Rx data */ - if (select(cr->l_socket+1,&rfd,NULL,NULL,&to) == SOCKET_ERROR) { + if (select(cr->socket+1,&rfd,NULL,NULL,&to) == SOCKET_ERROR) { rc = WSAGetLastError(); if ( rc != SOCKET_ERROR /*WSAENOTSOCK*/ ) { dapl_dbg_log (DAPL_DBG_TYPE_ERR/*CM*/, " thread: select(sock %d) ERR %d on cr %p\n", - cr->l_socket, rc, cr); + cr->socket, rc, cr); + } + closesocket(cr->socket); + cr->socket = -1; + } else if (FD_ISSET(cr->socket,&rfd)) { + if (cr->socket > 0) { + if (cr->state == SCM_LISTEN) + dapli_socket_accept(cr); + else if (cr->state == SCM_ACCEPTED) + dapli_socket_accept_rtu(cr); + else if (cr->state == SCM_CONN_PENDING) + dapli_socket_connect_rtu(cr); + else if (cr->state == SCM_CONNECTED) + dapli_socket_disconnect(cr); } - closesocket(cr->l_socket); - cr->l_socket = -1; - } else if (FD_ISSET(cr->l_socket,&rfd) && dapli_socket_accept(cr)) { - closesocket(cr->l_socket); - cr->l_socket = -1; } dapl_os_lock( &hca_ptr->ib_trans.lock ); next_cr = dapl_llist_next_entry((DAPL_LLIST_HEAD*) @@ -1263,9 +1564,19 @@ void cr_thread(void *arg) (DAPL_LLIST_ENTRY*)&cr->entry ); } dapl_os_unlock( &hca_ptr->ib_trans.lock ); + to.tv_sec = 0; to.tv_usec = 100000; /* wakeup and check destroy */ + (void) select(max_fd+1, &rfds, NULL, NULL, &to); + + /* if pipe data consume - used to wake this thread up */ + if (FD_ISSET(g_scm_pipe[0],&rfds)) { + dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread() read pipe data\n"); +printf(" cr_thread() read pipe data\n"); + _read(g_scm_pipe[0], rbuf, 2); +printf(" cr_thread() Finished read pipe data\n"); + } dapl_os_lock( &hca_ptr->ib_trans.lock ); } dapl_os_unlock( &hca_ptr->ib_trans.lock ); diff --git a/dapl/ibal-scm/dapl_ibal-scm_util.c b/dapl/ibal-scm/dapl_ibal-scm_util.c index 8e5f8ac..06bc704 100644 --- a/dapl/ibal-scm/dapl_ibal-scm_util.c +++ b/dapl/ibal-scm/dapl_ibal-scm_util.c @@ -52,6 +52,7 @@ static const char rcsid[] = "$Id: $"; #include "dapl.h" #include "dapl_adapter_util.h" #include "dapl_ibal_util.h" +#include "dapl_ibal_name_service.h" #include #include @@ -61,9 +62,12 @@ static const char rcsid[] = "$Id: $"; #include #include #include +#include +extern void cr_thread(void *arg); int g_dapl_loopback_connection = 0; +int g_scm_pipe[2]; #ifdef NOT_USED @@ -132,22 +136,55 @@ DAT_RETURN dapli_init_sock_cm ( IN DAPL_HCA *hca_ptr ) dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " %s(): %p\n",__FUNCTION__,hca_ptr ); - /* set inline max with enviroment or default */ + /* set RC tunables via enviroment or default */ hca_ptr->ib_trans.max_inline_send = - dapl_os_get_env_val ( "DAPL_MAX_INLINE", INLINE_SEND_DEFAULT ); + dapl_os_get_env_val("DAPL_MAX_INLINE", INLINE_SEND_DEFAULT); +#if 0 + hca_ptr->ib_trans.ack_retry = + dapl_os_get_env_val("DAPL_ACK_RETRY", SCM_ACK_RETRY); + hca_ptr->ib_trans.ack_timer = + dapl_os_get_env_val("DAPL_ACK_TIMER", SCM_ACK_TIMER); + hca_ptr->ib_trans.rnr_retry = + dapl_os_get_env_val("DAPL_RNR_RETRY", SCM_RNR_RETRY); + hca_ptr->ib_trans.rnr_timer = + dapl_os_get_env_val("DAPL_RNR_TIMER", SCM_RNR_TIMER); + hca_ptr->ib_trans.global = + dapl_os_get_env_val("DAPL_GLOBAL_ROUTING", SCM_GLOBAL); + hca_ptr->ib_trans.hop_limit = + dapl_os_get_env_val("DAPL_HOP_LIMIT", SCM_HOP_LIMIT); + hca_ptr->ib_trans.tclass = + dapl_os_get_env_val("DAPL_TCLASS", SCM_TCLASS); +#endif /* initialize cr_list lock */ dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock); if (dat_status != DAT_SUCCESS) { dapl_dbg_log (DAPL_DBG_TYPE_ERR, - " open_hca: failed to init lock\n"); + "%s() failed to init cr_list lock\n", __FUNCTION__); return DAT_INTERNAL_ERROR; } +#if 0 + /* initialize cq_lock */ + dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock); + if (dat_status != DAT_SUCCESS) { + dapl_log(DAPL_DBG_TYPE_ERR, + "%s() failed to init cq_lock\n", __FUNCTION__); + return DAT_INTERNAL_ERROR; + } +#endif + /* initialize CM list for listens on this HCA */ dapl_llist_init_head((DAPL_LLIST_HEAD*)&hca_ptr->ib_trans.list); + /* create pipe communication endpoints */ + if (_pipe(g_scm_pipe, 256, O_TEXT)) { + dapl_dbg_log (DAPL_DBG_TYPE_ERR, + "%s() failed to create thread\n", __FUNCTION__); + return DAT_INTERNAL_ERROR; + } + /* create thread to process inbound connect request */ hca_ptr->ib_trans.cr_state = IB_THREAD_INIT; dat_status = dapl_os_thread_create(cr_thread, @@ -199,6 +236,7 @@ DAT_RETURN dapli_close_sock_cm ( IN DAPL_HCA *hca_ptr ) /* destroy cr_thread and lock */ hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; + while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { dapl_dbg_log(DAPL_DBG_TYPE_UTIL, " close_hca: waiting for cr_thread\n"); From arlin.r.davis at intel.com Fri Jan 30 11:15:12 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Fri, 30 Jan 2009 11:15:12 -0800 Subject: [ofa-general] Multiports single HCA uDAPL program problem In-Reply-To: <4982A3D8.5030503@cs.anu.edu.au> References: <20090129200005.20863E61234@openfabrics.org> <4982A3D8.5030503@cs.anu.edu.au> Message-ID: This looks like an ARP issue across your IPoIB interfaces. Please see section 6 of the uDAPL OFED BKM. http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_ofed_testing_bkm.pdf 6. Multi IB port configuration, IPoIB arp reply issues When two interfaces running one interface may reply to an ARP directed to the other interface on the system. The following configuration will cause the interfaces to ignore ARP requests if not specifically for their IP address. Add the following lines to /etc/sysctl.conf net.ipv4.conf.all.arp_ignore=1 net.ipv4.conf.ib0.arp_ignore=1 net.ipv4.conf.ib1.arp_ignore=1 or use sysctl: sysctl -w net.ipv4.conf.all.arp_ignore=1 sysctl -w net.ipv4.conf.ib0.arp_ignore=1 sysctl -w net.ipv4.conf.ib1.arp_ignore=1 -arlin >-----Original Message----- >From: general-bounces at lists.openfabrics.org >[mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jie Cai >Sent: Thursday, January 29, 2009 10:53 PM >To: general at lists.openfabrics.org >Subject: [ofa-general] Multiports single HCA uDAPL program problem > >Hi All, > >I am kind of noob on IB and uDAPL program. Currently, I am trying to >write a program with multirail that utilizes 2 ports on a >single Mallenox >ConnectX HCA on both nodes. > >OFED1.3 has been installed on a SUSE 10.3 linux system. > >The current problem is that IB connection via uDAPL are very unstable, >and sometime the connection can't be established. >Error message is usually like: > >20350 Server waiting for connect request on port 45248 > accept: ERR dev(0x61d0e0!=0x61d0e0) or port mismatch(1!=2) >20350 Error dat_cr_accept: DAT_INTERNAL_ERROR >20350 Error connect_ep: DAT_INTERNAL_ERROR > >The status of both port are active: >hca_id: mlx4_0 > fw_ver: 2.3.000 > node_guid: 0003:ba00:0100:702c > sys_image_guid: 0003:ba00:0100:702f > vendor_id: 0x02c9 > vendor_part_id: 25418 > hw_ver: 0xA0 > board_id: SUN0070000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 10 > port_lid: 8 > port_lmc: 0x00 > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 10 > port_lid: 9 > port_lmc: 0x00 > > >I haven't done any specific configuration for multi-port. I assume that >OFED1.3 can do it automatically. > >Would please any one help me on this? > >Regards, >Jie > >-- >Jie Cai > > > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general > From ofedrnicuser at yahoo.com Fri Jan 30 11:16:31 2009 From: ofedrnicuser at yahoo.com (Ofed User) Date: Fri, 30 Jan 2009 11:16:31 -0800 (PST) Subject: ***SPAM*** Re: [ofa-general] vendor, device files not created under infiniband_verbs/uverbs0 In-Reply-To: <215301.57111.qm@web111209.mail.gq1.yahoo.com> Message-ID: <374509.4379.qm@web111212.mail.gq1.yahoo.com> Looks like the mail got lost as SPAM. Can anyone please point me which property should be populated to get these files in the uverbs0? I am new to the RDMA stack due to this my user space library is not loaded. Bill --- On Fri, 1/30/09, Ofed User wrote: > From: Ofed User > Subject: [ofa-general] ***SPAM*** vendor, device files not created under infiniband_verbs/uverbs0 > To: general at lists.openfabrics.org > Date: Friday, January 30, 2009, 1:03 PM > Hi, > > I have registered a pseudo RNIC device with the stack. > Stack doesn't create the (a) device/vendor (b) > device/device files under infinband_verbs/uverbs0 directory. > It does create rest of the files. > > Can someone tell me, which properties should I populate so > that these files get created properly? > #ls /sys/class/infiniband_verbs/uverbs0/ shows following > files. > abi_version dev ibdev subsystem uevent > > Bill > > From davem at davemloft.net Fri Jan 30 13:51:07 2009 From: davem at davemloft.net (David Miller) Date: Fri, 30 Jan 2009 13:51:07 -0800 (PST) Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: References: <20090130065721.GA4886@gondor.apana.org.au> Message-ID: <20090130.135107.116108989.davem@davemloft.net> From: Roland Dreier Date: Fri, 30 Jan 2009 09:35:52 -0800 > > > OK, thanks... what confused me is that several other drivers also do > > > skb_linearize() in their hard_start_xmit method... eg bnx2x, > > > via-velocity, mv643xx_eth. So there are several other lurking bugs to > > > deal with here I guess. > > > > I don't know about the rest but bnx2x is certainly OK since it > > only does so with IRQ enabled. It is legal to call skb_linearize > > as long as you're sure that IRQs are enabled, which is always the > > case for hard_start_xmit upon entry. > > I don't believe this is accurate. Calling skb_linearize() (on a kernel > with CONFIG_HIGHMEM set) can end up calling local_bh_enable() in > kunmap_skb_frag(), which can obviously cause problems if the initial > context relies on having BHs disabled (as hard_start_xmit does). local_bh_{enable,disable}() nests, so this is not a problem From arlin.r.davis at intel.com Fri Jan 30 14:34:38 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Fri, 30 Jan 2009 14:34:38 -0800 Subject: [ofa-general] RE: [ofw] [PATCH 5/5] [DAPL] dapl/ibal-scm: update ibal-scm provider In-Reply-To: References: Message-ID: Applied all 5 patches. Thanks. -arlin >-----Original Message----- >From: ofw-bounces at lists.openfabrics.org >[mailto:ofw-bounces at lists.openfabrics.org] On Behalf Of Sean Hefty >Sent: Friday, January 30, 2009 10:59 AM >To: Hefty, Sean; OpenIB; ofw at lists.openfabrics.org >Subject: [ofw] [PATCH 5/5] [DAPL] dapl/ibal-scm: update >ibal-scm provider > >From: Stan Smith > >Update the dapl.git tree with the latest SVN version of the >ibal-scm provider. > >Signed-off-by: Sean Hefty >--- >Actual codng changes were made by Stan. I'm just submitting the patch >to update the DAPL git repository. > > dapl/ibal-scm/dapl_ibal-scm_cm.c | 775 >+++++++++++++++++++++++++----------- > dapl/ibal-scm/dapl_ibal-scm_util.c | 44 ++ > 2 files changed, 584 insertions(+), 235 deletions(-) > >diff --git a/dapl/ibal-scm/dapl_ibal-scm_cm.c >b/dapl/ibal-scm/dapl_ibal-scm_cm.c >index df83008..6a050b8 100644 >--- a/dapl/ibal-scm/dapl_ibal-scm_cm.c >+++ b/dapl/ibal-scm/dapl_ibal-scm_cm.c >@@ -63,6 +63,88 @@ > #include > #include > >+extern int g_scm_pipe[2]; >+ >+extern DAT_RETURN >+dapls_ib_query_gid( IN DAPL_HCA *hca_ptr, >+ IN GID *gid ); >+ >+ >+static struct ib_cm_handle * dapli_cm_create(void) >+{ >+ struct ib_cm_handle *cm_ptr; >+ >+ /* Allocate CM, init lock, and initialize */ >+ if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL) >+ return NULL; >+ >+ if (dapl_os_lock_init(&cm_ptr->lock)) >+ goto bail; >+ >+ (void)dapl_os_memzero(cm_ptr, sizeof(*cm_ptr)); >+ cm_ptr->dst.ver = htons(DSCM_VER); >+ cm_ptr->socket = -1; >+ cm_ptr->l_socket = -1; >+ return cm_ptr; >+bail: >+ dapl_os_free(cm_ptr, sizeof(*cm_ptr)); >+ return NULL; >+} >+ >+ >+/* mark for destroy, remove all references, schedule cleanup */ >+ >+static void dapli_cm_destroy(struct ib_cm_handle *cm_ptr) >+{ >+ dapl_dbg_log(DAPL_DBG_TYPE_CM, >+ " cm_destroy: cm %p ep %p\n", cm_ptr,cm_ptr->ep); >+ >+ /* cleanup, never made it to work queue */ >+ if (cm_ptr->state == SCM_INIT) { >+ if (cm_ptr->socket >= 0) >+ closesocket(cm_ptr->socket); >+ if (cm_ptr->l_socket >= 0) >+ closesocket(cm_ptr->l_socket); >+ dapl_os_free(cm_ptr, sizeof(*cm_ptr)); >+ return; >+ } >+ >+ dapl_os_lock(&cm_ptr->lock); >+ cm_ptr->state = SCM_DESTROY; >+ if (cm_ptr->ep) >+ cm_ptr->ep->cm_handle = IB_INVALID_HANDLE; >+ >+ /* close socket if still active */ >+ if (cm_ptr->socket >= 0) { >+ closesocket(cm_ptr->socket); >+ cm_ptr->socket = -1; >+ } >+ if (cm_ptr->l_socket >= 0) { >+ closesocket(cm_ptr->l_socket); >+ cm_ptr->l_socket = -1; >+ } >+ dapl_os_unlock(&cm_ptr->lock); >+ >+ /* wakeup work thread */ >+ _write(g_scm_pipe[1], "w", sizeof "w"); >+} >+ >+ >+/* queue socket for processing CM work */ >+static void dapli_cm_queue(struct ib_cm_handle *cm_ptr) >+{ >+ /* add to work queue for cr thread processing */ >+ dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&cm_ptr->entry); >+ dapl_os_lock( &cm_ptr->hca->ib_trans.lock ); >+ >dapl_llist_add_tail((DAPL_LLIST_HEAD*)&cm_ptr->hca->ib_trans.list, >+ (DAPL_LLIST_ENTRY*)&cm_ptr->entry, >(void*)cm_ptr); >+ dapl_os_unlock(&cm_ptr->hca->ib_trans.lock); >+ >+ /* wakeup CM work thread */ >+ _write(g_scm_pipe[1], "w", sizeof "w"); >+} >+ >+ > > static uint16_t > dapli_get_lid(IN DAPL_HCA *hca, IN int port) >@@ -123,6 +205,263 @@ dapli_get_lid(IN DAPL_HCA *hca, IN int port) > > > /* >+ * ACTIVE/PASSIVE: called from CR thread or consumer via ep_disconnect >+ */ >+static DAT_RETURN >+dapli_socket_disconnect(dp_ib_cm_handle_t cm_ptr) >+{ >+ DAPL_EP *ep_ptr = cm_ptr->ep; >+ DAT_UINT32 disc_data = htonl(0xdead); >+ >+ if (ep_ptr == NULL) >+ return DAT_SUCCESS; >+ >+ dapl_os_lock(&cm_ptr->lock); >+ if ((cm_ptr->state == SCM_INIT) || >+ (cm_ptr->state == SCM_DISCONNECTED)) { >+ dapl_os_unlock(&cm_ptr->lock); >+ return DAT_SUCCESS; >+ } else { >+ /* send disc date, close socket, schedule destroy */ >+ if (cm_ptr->socket >= 0) { >+ send(cm_ptr->socket, (const char *)&disc_data, >+ sizeof(disc_data), 0); >+ closesocket(cm_ptr->socket); >+ cm_ptr->socket = -1; >+ } >+ cm_ptr->state = SCM_DISCONNECTED; >+ _write(g_scm_pipe[1], "w", sizeof "w"); >+ } >+ dapl_os_unlock(&cm_ptr->lock); >+ >+ >+ if (ep_ptr->cr_ptr) { >+ dapls_cr_callback(cm_ptr, >+ IB_CME_DISCONNECTED, >+ NULL, >+ ((DAPL_CR *)ep_ptr->cr_ptr)->sp_ptr); >+ } else { >+ dapl_evd_connection_callback(ep_ptr->cm_handle, >+ IB_CME_DISCONNECTED, >+ NULL, >+ ep_ptr); >+ } >+ >+ /* remove reference from endpoint */ >+ ep_ptr->cm_handle = NULL; >+ >+ /* schedule destroy */ >+ >+ >+ return DAT_SUCCESS; >+} >+ >+ >+ >+/* >+ * PASSIVE: consumer accept, send local QP information, private data, >+ * queue on work thread to receive RTU information to avoid blocking >+ * user thread. >+ */ >+static DAT_RETURN >+dapli_socket_accept_usr( DAPL_EP *ep_ptr, >+ DAPL_CR *cr_ptr, >+ DAT_COUNT p_size, >+ DAT_PVOID p_data ) >+{ >+ DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; >+ dp_ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; >+ WSABUF iovec[2]; >+ int len, rc; >+ short rtu_data = 0; >+ ib_api_status_t ibs; >+ ib_qp_attr_t qpa; >+ dapl_ibal_port_t *p_port; >+ dapl_ibal_ca_t *p_ca; >+ >+ dapl_dbg_log (DAPL_DBG_TYPE_EP, "%s() p_sz %d sock %d >port 0x%x\n", >+ __FUNCTION__,p_size,cm_ptr->socket, >+ ia_ptr->hca_ptr->port_num); >+ >+ if (p_size > IB_MAX_REP_PDATA_SIZE) >+ return DAT_LENGTH_ERROR; >+ >+ /* must have a accepted socket */ >+ if ( cm_ptr->socket < 0 ) { >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, >+ "%s() Not accepted socket? remote >port=0x%x lid=0x%x" >+ " qpn=0x%x psize=%d\n", >+ cm_ptr->dst.port, cm_ptr->dst.lid, >+ ntohs(cm_ptr->dst.qpn), cm_ptr->dst.p_size); >+ return DAT_INTERNAL_ERROR; >+ } >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, >+ " accept_usr: remote port=0x%x lid=0x%x" >+ " qpn=0x%x psize=%d\n", >+ cm_ptr->dst.port, cm_ptr->dst.lid, >+ ntohs(cm_ptr->dst.qpn), cm_ptr->dst.p_size); >+ >+ /* modify QP to RTR and then to RTS with remote info >already read */ >+ >+ p_ca = (dapl_ibal_ca_t *) ia_ptr->hca_ptr->ib_hca_handle; >+ p_port = dapli_ibal_get_port (p_ca, >(uint8_t)ia_ptr->hca_ptr->port_num); >+ if (!p_port) >+ { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ "%s() dapli_ibal_get_port() failed @ >line #%d\n", >+ __FUNCTION__,__LINE__); >+ goto bail; >+ } >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, >+ "%s() DST: qpn 0x%x port 0x%x lid %x >psize %d\n", >+ __FUNCTION__, >+ cl_ntoh32(cm_ptr->dst.qpn), >+ cm_ptr->dst.port, >+ cl_ntoh16(cm_ptr->dst.lid), >cm_ptr->dst.p_size); >+ >+ /* modify QP to RTR and then to RTS with remote info */ >+ >+ ibs = dapls_modify_qp_state_to_rtr( ep_ptr->qp_handle, >+ cm_ptr->dst.qpn, >+ cm_ptr->dst.lid, >+ p_port ); >+ if (ibs != IB_SUCCESS) >+ { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ "%s() QP --> RTR failed @ line #%d\n", >+ __FUNCTION__,__LINE__); >+ goto bail; >+ } >+ >+ if ( dapls_modify_qp_state_to_rts( ep_ptr->qp_handle ) ) >+ { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ "%s() QP --> RTS failed @ line #%d\n", >+ __FUNCTION__,__LINE__); >+ goto bail; >+ } >+ >+ ep_ptr->qp_state = IB_QP_STATE_RTS; >+ >+ /* save remote address information */ >+ dapl_os_memcpy( &ep_ptr->remote_ia_address, >+ &cm_ptr->dst.ia_address, >+ sizeof(ep_ptr->remote_ia_address)); >+ >+ /* determine QP & port numbers */ >+ ibs = ib_query_qp(ep_ptr->qp_handle, &qpa); >+ if (ibs != IB_SUCCESS) >+ { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ " ib_query_qp() ERR %s\n", >ib_get_err_str(ibs)); >+ goto bail; >+ } >+ >+ /* Send our QP info, IA address, and private data */ >+ cm_ptr->dst.qpn = qpa.num; /* ib_net32_t */ >+ cm_ptr->dst.port = ia_ptr->hca_ptr->port_num; >+ cm_ptr->dst.lid = dapli_get_lid(ia_ptr->hca_ptr, >ia_ptr->hca_ptr->port_num); >+ /* set gid in network order */ >+ ibs = dapls_ib_query_gid( ia_ptr->hca_ptr, &cm_ptr->dst.gid ); >+ if ( ibs != IB_SUCCESS ) >+ { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ "%s() dapls_ib_query_gid() returns '%s'\n", >+ __FUNCTION__,ib_get_err_str(ibs)); >+ goto bail; >+ } >+ >+ cm_ptr->dst.ia_address = ia_ptr->hca_ptr->hca_address; >+ cm_ptr->dst.p_size = p_size; >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_CM, >+ "%s()\n Tx QP info: qpn %x port 0x%x lid 0x%x >p_sz %d IP %s\n", >+ __FUNCTION__, cl_ntoh32(cm_ptr->dst.qpn), >cm_ptr->dst.port, >+ cl_ntoh16(cm_ptr->dst.lid), cm_ptr->dst.p_size, >+ dapli_get_ip_addr_str(&cm_ptr->dst.ia_address,NULL)); >+ >+ /* network byte-ordering - QPN & LID already are */ >+ cm_ptr->dst.p_size = cl_hton32(cm_ptr->dst.p_size); >+ cm_ptr->dst.port = cl_hton16(cm_ptr->dst.port); >+ >+ iovec[0].buf = (char*)&cm_ptr->dst; >+ iovec[0].len = sizeof(ib_qp_cm_t); >+ if (p_size) { >+ iovec[1].buf = p_data; >+ iovec[1].len = p_size; >+ } >+ rc = WSASend( cm_ptr->socket, iovec, (p_size ? 2:1), >&len, 0, 0, 0 ); >+ if (rc || len != (p_size + sizeof(ib_qp_cm_t))) { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ " accept_usr: ERR %d, wcnt=%d\n", >+ WSAGetLastError(), len); >+ goto bail; >+ } >+ dapl_dbg_log(DAPL_DBG_TYPE_CM, >+ " accept_usr: local port=0x%x lid=0x%x" >+ " qpn=0x%x psize=%d\n", >+ ntohs(cm_ptr->dst.port), ntohs(cm_ptr->dst.lid), >+ ntohl(cm_ptr->dst.qpn), >ntohl(cm_ptr->dst.p_size)); >+ >+ /* save state and reference to EP, queue for RTU data */ >+ cm_ptr->ep = ep_ptr; >+ cm_ptr->hca = ia_ptr->hca_ptr; >+ cm_ptr->state = SCM_ACCEPTED; >+ >+ /* restore remote address information for query */ >+ dapl_os_memcpy( &cm_ptr->dst.ia_address, >+ &ep_ptr->remote_ia_address, >+ sizeof(cm_ptr->dst.ia_address)); >+ >+ dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: accepted!\n" ); >+ dapli_cm_queue(cm_ptr); >+ >+ return DAT_SUCCESS; >+ >+bail: >+ dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept_usr: ERR >!QP_RTR_RTS \n"); >+ dapli_cm_destroy(cm_ptr); >+ dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ >+ >+ return DAT_INTERNAL_ERROR; >+} >+ >+ >+/* >+ * PASSIVE: read RTU from active peer, post CONN event >+ */ >+void >+dapli_socket_accept_rtu(dp_ib_cm_handle_t cm_ptr) >+{ >+ int len; >+ short rtu_data = 0; >+ >+ /* complete handshake after final QP state change */ >+ len = recv(cm_ptr->socket, (char*)&rtu_data, >sizeof(rtu_data), 0); >+ if ( len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f ) { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ " accept_rtu: ERR %d, rcnt=%d rdata=%x\n", >+ WSAGetLastError(), len, ntohs(rtu_data) ); >+ goto bail; >+ } >+ >+ /* save state and reference to EP, queue for disc event */ >+ cm_ptr->state = SCM_CONNECTED; >+ >+ /* final data exchange if remote QP state is good to go */ >+ dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: connected!\n" ); >+ dapls_cr_callback(cm_ptr, IB_CME_CONNECTED, NULL, cm_ptr->sp); >+ return; >+bail: >+ dapls_ib_reinit_ep(cm_ptr->ep); /* reset QP state */ >+ dapli_cm_destroy(cm_ptr); >+ dapls_cr_callback(cm_ptr, IB_CME_DESTINATION_REJECT, >NULL, cm_ptr->sp); >+} >+ >+ >+/* > * ACTIVE: Create socket, connect, and exchange QP information > */ > static DAT_RETURN >@@ -143,21 +482,16 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > dapl_ibal_port_t *p_port; > dapl_ibal_ca_t *p_ca; > >- dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual >%d\n", r_qual); >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual %d >psize %d\n", >+ r_qual, p_size); > >- /* >- * Allocate CM and initialize >- */ >- if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL ) { >+ cm_ptr = dapli_cm_create(); >+ if (cm_ptr == NULL) > return DAT_INSUFFICIENT_RESOURCES; >- } >- >- (void) dapl_os_memzero( cm_ptr, sizeof(*cm_ptr) ); >- cm_ptr->socket = -1; > > /* create, connect, sockopt, and exchange QP information */ > if ((cm_ptr->socket = socket(AF_INET,SOCK_STREAM,0)) < 0 ) { >- dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); >+ dapli_cm_destroy(cm_ptr); > return DAT_INSUFFICIENT_RESOURCES; > } > >@@ -166,7 +500,7 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > if (connect(cm_ptr->socket, r_addr, sizeof(*r_addr)) >== SOCKET_ERROR) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, " connect: %d >on r_qual %d\n", > WSAGetLastError(), (unsigned int)r_qual); >- dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); >+ dapli_cm_destroy(cm_ptr); > return DAT_INVALID_ADDRESS; > } > >@@ -175,6 +509,8 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > (const char*)&opt, > sizeof(opt) ); > >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, " socket connected!\n"); >+ > /* determine QP & port numbers */ > ibs = ib_query_qp(ep_ptr->qp_handle, &qpa); > if (ibs != IB_SUCCESS) >@@ -187,7 +523,6 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > /* Send QP info, IA address and private data */ > cm_ptr->dst.qpn = qpa.num; /* ib_net32_t */ > cm_ptr->dst.port = cl_hton16(ia_ptr->hca_ptr->port_num); >- > cm_ptr->dst.lid = dapli_get_lid( ia_ptr->hca_ptr, > ia_ptr->hca_ptr->port_num ); > if (cm_ptr->dst.lid == 0) >@@ -197,6 +532,17 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > __FUNCTION__, __LINE__); > goto bail; > } >+ >+ /* set gid in network order */ >+ ibs = dapls_ib_query_gid( ia_ptr->hca_ptr, &cm_ptr->dst.gid ); >+ if ( ibs != IB_SUCCESS ) >+ { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ "%s() dapls_ib_query_gid() returns '%s'\n", >+ __FUNCTION__,ib_get_err_str(ibs)); >+ goto bail; >+ } >+ > cm_ptr->dst.ia_address = ia_ptr->hca_ptr->hca_address; > cm_ptr->dst.p_size = cl_hton32(p_size); > >@@ -213,6 +559,8 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > iovec[1].buf = p_data; > iovec[1].len = p_size; > } >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_EP," socket connected, >write QP and private data\n"); > rc = WSASend (cm_ptr->socket, iovec, (p_size ? 2:1), >&len, 0, 0, NULL); > if ( rc || len != (p_size + sizeof(ib_qp_cm_t)) ) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, >@@ -225,17 +573,65 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > cm_ptr->dst.port, cm_ptr->dst.lid, > cm_ptr->dst.qpn, cm_ptr->dst.p_size ); > >+ /* queue up to work thread to avoid blocking consumer */ >+ cm_ptr->state = SCM_CONN_PENDING; >+ cm_ptr->hca = ia_ptr->hca_ptr; >+ cm_ptr->ep = ep_ptr; >+ dapli_cm_queue(cm_ptr); >+ return DAT_SUCCESS; >+bail: >+ /* close socket, free cm structure */ >+ dapli_cm_destroy(cm_ptr); >+ return DAT_INTERNAL_ERROR; >+} >+ >+ >+/* >+ * ACTIVE: exchange QP information, called from CR thread >+ */ >+void >+dapli_socket_connect_rtu(dp_ib_cm_handle_t cm_ptr) >+{ >+ DAPL_EP *ep_ptr = cm_ptr->ep; >+ DAPL_IA *ia_ptr = cm_ptr->ep->header.owner_ia; >+ int len, rc; >+ DWORD ioflags; >+ WSABUF iovec[1]; >+ short rtu_data = htons(0x0E0F); >+ ib_cm_events_t event = IB_CME_DESTINATION_REJECT; >+ ib_api_status_t ibs; >+ dapl_ibal_port_t *p_port; >+ dapl_ibal_ca_t *p_ca; >+ > /* read DST information into cm_ptr, overwrite SRC info */ >+ dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: recv peer >QP data\n"); >+ >+ iovec[0].buf = (char*)&cm_ptr->dst; >+ iovec[0].len = sizeof(ib_qp_cm_t); > ioflags = len = 0; > rc = WSARecv (cm_ptr->socket, iovec, 1, &len, &ioflags, 0, 0); >- if ( rc == SOCKET_ERROR || len != sizeof(ib_qp_cm_t) ) { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR,"connect read: >ERR %d rcnt=%d\n", >- WSAGetLastError(), len); >+ if ( rc == SOCKET_ERROR || len != sizeof(ib_qp_cm_t) || >+ ntohs(cm_ptr->dst.ver) >!= DSCM_VER ) >+ { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ "connect_rtu read: ERR %d rcnt=%d >ver=%d\n", >+ WSAGetLastError(), len, cm_ptr->dst.ver); >+ goto bail; >+ } >+ >+ /* check for consumer reject */ >+ if (cm_ptr->dst.rej) { >+ dapl_dbg_log(DAPL_DBG_TYPE_CM, >+ " connect_rtu read: PEER REJ >reason=0x%x\n", >+ ntohs(cm_ptr->dst.rej)); >+ event = IB_CME_DESTINATION_REJECT_PRIVATE_DATA; > goto bail; > } > >- /* revert back to host byte ordering */ >+ /* convert peer response values to host order */ > cm_ptr->dst.port = cl_ntoh16(cm_ptr->dst.port); >+ cm_ptr->dst.lid = ntohs(cm_ptr->dst.lid); >+ cm_ptr->dst.qpn = cm_ptr->dst.qpn; > cm_ptr->dst.p_size = cl_ntoh32(cm_ptr->dst.p_size); > > dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: Rx DST: qpn >%x port %d " >@@ -245,15 +641,27 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > cm_ptr->dst.p_size, > >dapli_get_ip_addr_str(&cm_ptr->dst.ia_address,NULL)); > >+ /* save remote address information */ >+ dapl_os_memcpy( &ep_ptr->remote_ia_address, >+ &cm_ptr->dst.ia_address, >+ sizeof(ep_ptr->remote_ia_address)); >+ >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, >+ " connect_rtu: DST %s port=0x%x lid=0x%x, >qpn=0x%x, psize=%d\n", >+ inet_ntoa(((struct sockaddr_in >*)&cm_ptr->dst.ia_address)->sin_addr), >+ cm_ptr->dst.port, cm_ptr->dst.lid, >+ cm_ptr->dst.qpn, cm_ptr->dst.p_size); >+ > /* validate private data size before reading */ >- if ( cm_ptr->dst.p_size > IB_MAX_REP_PDATA_SIZE ) { >+ if (cm_ptr->dst.p_size > IB_MAX_REP_PDATA_SIZE) { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- " connect read: psize (%d) wrong\n", >+ " connect_rtu read: psize (%d) wrong\n", > cm_ptr->dst.p_size ); > goto bail; > } > > /* read private data into cm_handle if any present */ >+ dapl_dbg_log(DAPL_DBG_TYPE_EP," socket connected, read >private data\n"); > if ( cm_ptr->dst.p_size ) { > iovec[0].buf = cm_ptr->p_data; > iovec[0].len = cm_ptr->dst.p_size; >@@ -300,32 +708,29 @@ dapli_socket_connect ( DAPL_EP > *ep_ptr, > > ep_ptr->qp_state = IB_QP_STATE_RTS; > >+ dapl_dbg_log(DAPL_DBG_TYPE_EP," connect_rtu: send RTU\n"); >+ > /* complete handshake after final QP state change */ > send(cm_ptr->socket, (const char *)&rtu_data, >sizeof(rtu_data), 0); > > /* init cm_handle and post the event with private data */ > ep_ptr->cm_handle = cm_ptr; >+ cm_ptr->state = SCM_CONNECTED; > dapl_dbg_log( DAPL_DBG_TYPE_EP," ACTIVE: connected!\n" ); > > dapl_evd_connection_callback( ep_ptr->cm_handle, > IB_CME_CONNECTED, > cm_ptr->p_data, > ep_ptr ); >- return DAT_SUCCESS; >- >+ return; > bail: > /* close socket, free cm structure and post error event */ >- if ( cm_ptr->socket >= 0 ) >- closesocket(cm_ptr->socket); >- >- dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); >- dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ >- >- dapl_evd_connection_callback( ep_ptr->cm_handle, >- IB_CME_LOCAL_FAILURE, >+ dapli_cm_destroy(cm_ptr); >+ dapls_ib_reinit_ep(ep_ptr); /* reset QP state */ >+ dapl_evd_connection_callback( NULL /*ep_ptr->cm_handle*/, >+ event, > NULL, > ep_ptr ); >- return DAT_INTERNAL_ERROR; > } > > >@@ -347,14 +752,12 @@ dapli_socket_listen ( DAPL_IA > *ia_ptr, > ia_ptr, serviceID, sp_ptr); > > /* Allocate CM and initialize */ >- if ((cm_ptr = dapl_os_alloc(sizeof(*cm_ptr))) == NULL) >+ cm_ptr = dapli_cm_create(); >+ if (cm_ptr == NULL) > return DAT_INSUFFICIENT_RESOURCES; > >- (void) dapl_os_memzero( cm_ptr, sizeof( *cm_ptr ) ); >- >- cm_ptr->socket = cm_ptr->l_socket = -1; > cm_ptr->sp = sp_ptr; >- cm_ptr->hca_ptr = ia_ptr->hca_ptr; >+ cm_ptr->hca = ia_ptr->hca_ptr; > > /* bind, listen, set sockopt, accept, exchange data */ > if ((cm_ptr->l_socket = socket(AF_INET, SOCK_STREAM, 0)) < 0) { >@@ -406,12 +809,9 @@ dapli_socket_listen ( DAPL_IA > *ia_ptr, > /* set cm_handle for this service point, save listen socket */ > sp_ptr->cm_srvc_handle = cm_ptr; > >- /* add to SP->CR thread list */ >- dapl_llist_init_entry((DAPL_LLIST_ENTRY*)&cm_ptr->entry); >- dapl_os_lock( &cm_ptr->hca_ptr->ib_trans.lock ); >- >dapl_llist_add_tail((DAPL_LLIST_HEAD*)&cm_ptr->hca_ptr->ib_trans.list, >- (DAPL_LLIST_ENTRY*)&cm_ptr->entry, >(void*)cm_ptr); >- dapl_os_unlock(&cm_ptr->hca_ptr->ib_trans.lock); >+ /* queue up listen socket to process inbound CR's */ >+ cm_ptr->state = SCM_LISTEN; >+ dapli_cm_queue(cm_ptr); > > dapl_dbg_log( DAPL_DBG_TYPE_CM, > " listen: qual 0x%x cr %p s_fd %d\n", >@@ -421,10 +821,7 @@ dapli_socket_listen ( DAPL_IA > *ia_ptr, > bail: > dapl_dbg_log( DAPL_DBG_TYPE_CM, > " listen: ERROR on conn_qual >0x%x\n",serviceID); >- if ( cm_ptr->l_socket >= 0 ) >- closesocket( cm_ptr->l_socket ); >- >- dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); >+ dapli_cm_destroy(cm_ptr); > return dat_status; > } > >@@ -441,6 +838,8 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) > int len; > DAT_RETURN dat_status = DAT_SUCCESS; > >+ dapl_dbg_log(DAPL_DBG_TYPE_EP," socket_accept\n"); >+ > /* Allocate accept CM and initialize */ > if ((acm_ptr = dapl_os_alloc(sizeof(*acm_ptr))) == NULL) > return DAT_INSUFFICIENT_RESOURCES; >@@ -448,8 +847,9 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) > (void) dapl_os_memzero( acm_ptr, sizeof( *acm_ptr ) ); > > acm_ptr->socket = -1; >+ acm_ptr->l_socket = -1; > acm_ptr->sp = cm_ptr->sp; >- acm_ptr->hca_ptr = cm_ptr->hca_ptr; >+ acm_ptr->hca = cm_ptr->hca; > > len = sizeof(acm_ptr->dst.ia_address); > acm_ptr->socket = accept(cm_ptr->l_socket, >@@ -464,27 +864,32 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) > goto bail; > } > >+ dapl_dbg_log(DAPL_DBG_TYPE_EP," socket accepted, read >QP data\n"); >+ > /* read in DST QP info, IA address. check for private data */ > len = >recv(acm_ptr->socket,(char*)&acm_ptr->dst,sizeof(ib_qp_cm_t),0); >- if ( len != sizeof(ib_qp_cm_t) ) { >+ if ( len != sizeof(ib_qp_cm_t) || >ntohs(acm_ptr->dst.ver) != DSCM_VER ) >+ { > dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- " accept read: ERR %d, rcnt=%d\n", >- WSAGetLastError(), len); >+ " accept read: ERR %d, rcnt=%d ver=%d\n", >+ WSAGetLastError(), len, acm_ptr->dst.ver); > dat_status = DAT_INTERNAL_ERROR; > goto bail; > > } >- /* revert back to host byte ordering */ >+ /* convert accepted values to host byte ordering */ > acm_ptr->dst.port = cl_ntoh16(acm_ptr->dst.port); >+ acm_ptr->dst.lid = ntohs(acm_ptr->dst.lid); >+ acm_ptr->dst.qpn = acm_ptr->dst.qpn; > acm_ptr->dst.p_size = cl_ntoh32(acm_ptr->dst.p_size); > >- dapl_dbg_log(DAPL_DBG_TYPE_EP, " accept: DST >sizeof(ib_cm_t) %d qpn %x " >- "port %d lid 0x%x psize %d IP %s\n", >- sizeof(ib_qp_cm_t), >- cl_ntoh32(acm_ptr->dst.qpn), acm_ptr->dst.port, >+ dapl_dbg_log(DAPL_DBG_TYPE_EP, " accept: DST %s port 0x%x " >+ "lid 0x%x qpn 0x%x psize %d\n", >+ dapli_get_ip_addr_str(&acm_ptr->dst.ia_address,NULL), >+ acm_ptr->dst.port, > cl_ntoh16(acm_ptr->dst.lid), >- acm_ptr->dst.p_size, >- dapli_get_ip_addr_str(&acm_ptr->dst.ia_address,NULL)); >+ cl_ntoh32(acm_ptr->dst.qpn), >+ acm_ptr->dst.p_size); > > /* validate private data size before reading */ > if ( acm_ptr->dst.p_size > IB_MAX_REQ_PDATA_SIZE ) { >@@ -495,6 +900,8 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) > goto bail; > } > >+ dapl_dbg_log(DAPL_DBG_TYPE_EP," socket accepted, read >private data\n"); >+ > /* read private data into cm_handle if any present */ > if ( acm_ptr->dst.p_size ) { > len = recv( acm_ptr->socket, >@@ -514,6 +921,8 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) > p_data = acm_ptr->p_data; > } > >+ acm_ptr->state = SCM_ACCEPTING; >+ > /* trigger CR event and return SUCCESS */ > dapls_cr_callback( acm_ptr, > IB_CME_CONNECTION_REQUEST_PENDING, >@@ -521,153 +930,8 @@ dapli_socket_accept(ib_cm_srvc_handle_t cm_ptr) > acm_ptr->sp ); > > return DAT_SUCCESS; >- >-bail: >- if ( acm_ptr->socket >= 0 ) >- closesocket( acm_ptr->socket ); >- >- dapl_os_free( acm_ptr, sizeof( *acm_ptr ) ); >- return DAT_INTERNAL_ERROR; >-} >- >- >-static DAT_RETURN >-dapli_socket_accept_final( DAPL_EP *ep_ptr, >- DAPL_CR *cr_ptr, >- DAT_COUNT p_size, >- DAT_PVOID p_data ) >-{ >- DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; >- dp_ib_cm_handle_t cm_ptr = cr_ptr->ib_cm_handle; >- ib_qp_cm_t qp_cm; >- WSABUF iovec[2]; >- int len, rc; >- short rtu_data = 0; >- ib_api_status_t ibs; >- ib_qp_attr_t qpa; >- dapl_ibal_port_t *p_port; >- dapl_ibal_ca_t *p_ca; >- >- dapl_dbg_log (DAPL_DBG_TYPE_EP, "%s() p_sz %d sock %d >port %d\n", >- __FUNCTION__,p_size,cm_ptr->socket, >- ia_ptr->hca_ptr->port_num); >- >- if (p_size > IB_MAX_REP_PDATA_SIZE) >- return DAT_LENGTH_ERROR; >- >- /* must have a accepted socket */ >- if ( cm_ptr->socket < 0 ) >- return DAT_INTERNAL_ERROR; >- >- /* modify QP to RTR and then to RTS with remote info >already read */ >- >- p_ca = (dapl_ibal_ca_t *) ia_ptr->hca_ptr->ib_hca_handle; >- p_port = dapli_ibal_get_port (p_ca, >(uint8_t)ia_ptr->hca_ptr->port_num); >- if (!p_port) >- { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- "%s() dapli_ibal_get_port() failed @ >line #%d\n", >- __FUNCTION__,__LINE__); >- goto bail; >- } >- >- dapl_dbg_log(DAPL_DBG_TYPE_EP, "%s() DST: qpn %x port >%d lid %x\n", >- __FUNCTION__, >- cl_ntoh32(cm_ptr->dst.qpn), >- cm_ptr->dst.port, >- cl_ntoh16(cm_ptr->dst.lid)); >- >- /* modify QP to RTR and then to RTS with remote info */ >- >- ibs = dapls_modify_qp_state_to_rtr( ep_ptr->qp_handle, >- cm_ptr->dst.qpn, >- cm_ptr->dst.lid, >- p_port ); >- if (ibs != IB_SUCCESS) >- { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- "%s() QP --> RTR failed @ line #%d\n", >- __FUNCTION__,__LINE__); >- goto bail; >- } >- >- if ( dapls_modify_qp_state_to_rts( ep_ptr->qp_handle ) ) >- { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- "%s() QP --> RTS failed @ line #%d\n", >- __FUNCTION__,__LINE__); >- goto bail; >- } >- >- ep_ptr->qp_state = IB_QP_STATE_RTS; >- >- /* determine QP & port numbers */ >- ibs = ib_query_qp(ep_ptr->qp_handle, &qpa); >- if (ibs != IB_SUCCESS) >- { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- " ib_query_qp() ERR %s\n", >ib_get_err_str(ibs)); >- goto bail; >- } >- >- /* Send QP info, IA address, and private data */ >- qp_cm.qpn = qpa.num; /* ib_net32_t */ >- qp_cm.port = ia_ptr->hca_ptr->port_num; >- qp_cm.lid = dapli_get_lid( ia_ptr->hca_ptr, >ia_ptr->hca_ptr->port_num ); >- qp_cm.ia_address = ia_ptr->hca_ptr->hca_address; >- qp_cm.p_size = p_size; >- >- dapl_dbg_log(DAPL_DBG_TYPE_CM, >- "%s()\n Tx QP info: qpn %x port %d lid 0x%x >p_sz %d IP %s\n", >- __FUNCTION__, cl_ntoh32(qp_cm.qpn), qp_cm.port, >- cl_ntoh16(qp_cm.lid), qp_cm.p_size, >- dapli_get_ip_addr_str(&qp_cm.ia_address,NULL)); >- >- /* network byte-ordering - QPN & LID already are */ >- qp_cm.p_size = cl_hton32(qp_cm.p_size); >- qp_cm.port = cl_hton16(qp_cm.port); >- >- iovec[0].buf = (char*)&qp_cm; >- iovec[0].len = sizeof(ib_qp_cm_t); >- if (p_size) { >- iovec[1].buf = p_data; >- iovec[1].len = p_size; >- } >- rc = WSASend( cm_ptr->socket, iovec, (p_size ? 2:1), >&len, 0, 0, 0 ); >- if (rc || len != (p_size + sizeof(ib_qp_cm_t))) { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- " accept_final: ERR %d, wcnt=%d\n", >- WSAGetLastError(), len); >- goto bail; >- } >- dapl_dbg_log(DAPL_DBG_TYPE_EP, >- " accept_final: SRC qpn %x port %d lid >0x%x psize %d\n", >- qp_cm.qpn, qp_cm.port, qp_cm.lid, qp_cm.p_size ); >- >- /* complete handshake after final QP state change */ >- len = recv(cm_ptr->socket, (char*)&rtu_data, >sizeof(rtu_data), 0); >- if ( len != sizeof(rtu_data) || ntohs(rtu_data) != 0x0e0f ) { >- dapl_dbg_log(DAPL_DBG_TYPE_ERR, >- " accept_final: ERR %d, rcnt=%d >rdata=%x\n", >- WSAGetLastError(), len, ntohs(rtu_data) ); >- goto bail; >- } >- >- /* final data exchange if remote QP state is good to go */ >- dapl_dbg_log( DAPL_DBG_TYPE_EP," PASSIVE: connected!\n" ); >- >- dapls_cr_callback ( cm_ptr, IB_CME_CONNECTED, NULL, >cm_ptr->sp ); >- >- return DAT_SUCCESS; >- > bail: >- dapl_dbg_log( DAPL_DBG_TYPE_ERR," accept_final: ERR >!QP_RTR_RTS \n"); >- if ( cm_ptr->socket >= 0 ) >- closesocket( cm_ptr->socket ); >- >- dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); >- dapls_ib_reinit_ep( ep_ptr ); /* reset QP state */ >- >+ dapli_cm_destroy(acm_ptr); > return DAT_INTERNAL_ERROR; > } > >@@ -747,11 +1011,7 @@ dapls_ib_disconnect ( > dapl_dbg_log (DAPL_DBG_TYPE_EP, > "dapls_ib_disconnect(ep_handle %p >....)\n", ep_ptr); > >- if ( cm_ptr->socket >= 0 ) { >- closesocket( cm_ptr->socket ); >- cm_ptr->socket = -1; >- } >- >+#if 0 // XXX > /* disconnect QP ala transition to RESET state */ > ib_status = dapls_modify_qp_state_to_reset (ep_ptr->qp_handle); > >@@ -776,15 +1036,18 @@ dapls_ib_disconnect ( > NULL, > ep_ptr ); > ep_ptr->cm_handle = NULL; >- dapl_os_free( cm_ptr, sizeof( *cm_ptr ) ); > } >- >+#endif > /* modify QP state --> INIT */ > dapls_ib_reinit_ep(ep_ptr); > >+ if (cm_ptr == NULL) > return DAT_SUCCESS; >+ else >+ return dapli_socket_disconnect(cm_ptr); > } > >+ > /* > * dapls_ib_disconnect_clean > * >@@ -874,14 +1137,20 @@ dapls_ib_remove_conn_listener ( > if ( cm_ptr != NULL ) { > if ( cm_ptr->l_socket >= 0 ) { > closesocket( cm_ptr->l_socket ); >+ cm_ptr->l_socket = -1; >+ } >+ if ( cm_ptr->socket >= 0 ) { >+ closesocket( cm_ptr->socket ); > cm_ptr->socket = -1; > } > /* cr_thread will free */ > sp_ptr->cm_srvc_handle = NULL; >+ _write(g_scm_pipe[1], "w", sizeof "w"); > } > return DAT_SUCCESS; > } > >+ > /* > * dapls_ib_accept_connection > * >@@ -928,7 +1197,7 @@ dapls_ib_accept_connection ( > return status; > } > >- return ( dapli_socket_accept_final(ep_ptr, cr_ptr, >p_size, p_data) ); >+ return ( dapli_socket_accept_usr(ep_ptr, cr_ptr, >p_size, p_data) ); > } > > >@@ -948,27 +1217,39 @@ dapls_ib_accept_connection ( > * DAT_INTERNAL_ERROR > * > */ >+ > DAT_RETURN > dapls_ib_reject_connection ( >- IN dp_ib_cm_handle_t ib_cm_handle, >+ IN dp_ib_cm_handle_t cm_ptr, > IN int reject_reason, >- IN DAT_COUNT private_data_size, >- IN const DAT_PVOID private_data) >+ IN DAT_COUNT psize, >+ IN const DAT_PVOID pdata) > { >- ib_cm_srvc_handle_t cm_ptr = ib_cm_handle; >+ WSABUF iovec[1]; >+ int len; > > dapl_dbg_log (DAPL_DBG_TYPE_EP, >- "dapls_ib_reject_connection(cm_handle %p >reason %x)\n", >- ib_cm_handle, reject_reason ); >- >- /* just close the socket and return */ >- if ( cm_ptr->socket > 0 ) { >- closesocket( cm_ptr->socket ); >+ " reject(cm %p reason %x pdata %p psize %d)\n", >+ cm_ptr, reject_reason, pdata, psize ); >+ >+ /* write reject data to indicate reject */ >+ if (cm_ptr->socket >= 0) { >+ cm_ptr->dst.rej = (uint16_t)reject_reason; >+ cm_ptr->dst.rej = cl_hton16(cm_ptr->dst.rej); >+ iovec[0].buf = (char*)&cm_ptr->dst; >+ iovec[0].len = sizeof(ib_qp_cm_t); >+ (void) WSASend (cm_ptr->socket, iovec, 1, >&len, 0, 0, NULL); >+ closesocket(cm_ptr->socket); > cm_ptr->socket = -1; > } >+ >+ /* cr_thread will destroy CR */ >+ cm_ptr->state = SCM_REJECTED; >+ _write(g_scm_pipe[1], "w", sizeof "w"); > return DAT_SUCCESS; > } > >+ > /* > * dapls_ib_cm_remote_addr > * >@@ -1157,7 +1438,7 @@ dapls_ib_get_dat_event ( > > > /* >- * dapls_ib_get_dat_event >+ * dapls_ib_get_cm_event > * > * Return a DAT connection event given a provider CM event. > * >@@ -1189,12 +1470,16 @@ dapls_ib_get_cm_event ( > } > #endif /* NOT_USED */ > >-/* async CR processing thread to avoid blocking applications */ >+/* outbound/inbound CR processing thread to avoid blocking >applications */ >+ >+#define SCM_MAX_CONN (8 * sizeof(fd_set)) >+ > void cr_thread(void *arg) > { > struct dapl_hca *hca_ptr = arg; > ib_cm_srvc_handle_t cr, next_cr; > int max_fd, rc; >+ char rbuf[2]; > fd_set rfd, rfds; > struct timeval to; > >@@ -1202,10 +1487,12 @@ void cr_thread(void *arg) > > dapl_os_lock( &hca_ptr->ib_trans.lock ); > hca_ptr->ib_trans.cr_state = IB_THREAD_RUN; >+ > while (hca_ptr->ib_trans.cr_state == IB_THREAD_RUN) { > > FD_ZERO( &rfds ); >- max_fd = -1; >+ FD_SET(g_scm_pipe[0], &rfds); >+ max_fd = g_scm_pipe[0]; > > if >(!dapl_llist_is_empty((DAPL_LLIST_HEAD*)&hca_ptr->ib_trans.list)) > next_cr = dapl_llist_peek_head((DAPL_LLIST_HEAD*) >@@ -1230,32 +1517,46 @@ void cr_thread(void *arg) > continue; > } > >- FD_SET( cr->l_socket, &rfds ); /* add to select set */ >- if ( cr->l_socket > max_fd ) >+ if (cr->socket > SCM_MAX_CONN-1) { >+ dapl_dbg_log(DAPL_DBG_TYPE_ERR, >+ "SCM ERR: cr->socket(%d) exceeded >FD_SETSIZE %d\n", >+ cr->socket,SCM_MAX_CONN-1); >+ continue; >+ } >+ FD_SET( cr->socket, &rfds ); /* add to select SET */ >+ if ( cr->socket > max_fd ) > max_fd = cr->l_socket; > > /* individual select poll to check for work */ > FD_ZERO(&rfd); >- FD_SET(cr->l_socket, &rfd); >+ FD_SET(cr->socket, &rfd); > dapl_os_unlock(&hca_ptr->ib_trans.lock); > > to.tv_sec = 0; > to.tv_usec = 0; /* wakeup and check destroy */ > > /* block waiting for Rx data */ >- if (select(cr->l_socket+1,&rfd,NULL,NULL,&to) == >SOCKET_ERROR) { >+ if (select(cr->socket+1,&rfd,NULL,NULL,&to) == >SOCKET_ERROR) { > rc = WSAGetLastError(); > if ( rc != SOCKET_ERROR /*WSAENOTSOCK*/ ) > { > dapl_dbg_log (DAPL_DBG_TYPE_ERR/*CM*/, > " thread: select(sock %d) ERR >%d on cr %p\n", >- cr->l_socket, rc, cr); >+ cr->socket, rc, cr); >+ } >+ closesocket(cr->socket); >+ cr->socket = -1; >+ } else if (FD_ISSET(cr->socket,&rfd)) { >+ if (cr->socket > 0) { >+ if (cr->state == SCM_LISTEN) >+ dapli_socket_accept(cr); >+ else if (cr->state == SCM_ACCEPTED) >+ dapli_socket_accept_rtu(cr); >+ else if (cr->state == SCM_CONN_PENDING) >+ dapli_socket_connect_rtu(cr); >+ else if (cr->state == SCM_CONNECTED) >+ dapli_socket_disconnect(cr); > } >- closesocket(cr->l_socket); >- cr->l_socket = -1; >- } else if (FD_ISSET(cr->l_socket,&rfd) && >dapli_socket_accept(cr)) { >- closesocket(cr->l_socket); >- cr->l_socket = -1; > } > dapl_os_lock( &hca_ptr->ib_trans.lock ); > next_cr = dapl_llist_next_entry((DAPL_LLIST_HEAD*) >@@ -1263,9 +1564,19 @@ void cr_thread(void *arg) > >(DAPL_LLIST_ENTRY*)&cr->entry ); > } > dapl_os_unlock( &hca_ptr->ib_trans.lock ); >+ > to.tv_sec = 0; > to.tv_usec = 100000; /* wakeup and check destroy */ >+ > (void) select(max_fd+1, &rfds, NULL, NULL, &to); >+ >+ /* if pipe data consume - used to wake this thread up */ >+ if (FD_ISSET(g_scm_pipe[0],&rfds)) { >+ dapl_dbg_log(DAPL_DBG_TYPE_CM," cr_thread() >read pipe data\n"); >+printf(" cr_thread() read pipe data\n"); >+ _read(g_scm_pipe[0], rbuf, 2); >+printf(" cr_thread() Finished read pipe data\n"); >+ } > dapl_os_lock( &hca_ptr->ib_trans.lock ); > } > dapl_os_unlock( &hca_ptr->ib_trans.lock ); >diff --git a/dapl/ibal-scm/dapl_ibal-scm_util.c >b/dapl/ibal-scm/dapl_ibal-scm_util.c >index 8e5f8ac..06bc704 100644 >--- a/dapl/ibal-scm/dapl_ibal-scm_util.c >+++ b/dapl/ibal-scm/dapl_ibal-scm_util.c >@@ -52,6 +52,7 @@ static const char rcsid[] = "$Id: $"; > #include "dapl.h" > #include "dapl_adapter_util.h" > #include "dapl_ibal_util.h" >+#include "dapl_ibal_name_service.h" > > #include > #include >@@ -61,9 +62,12 @@ static const char rcsid[] = "$Id: $"; > #include > #include > #include >+#include > >+extern void cr_thread(void *arg); > > int g_dapl_loopback_connection = 0; >+int g_scm_pipe[2]; > > #ifdef NOT_USED > >@@ -132,22 +136,55 @@ DAT_RETURN dapli_init_sock_cm ( IN >DAPL_HCA *hca_ptr ) > > dapl_dbg_log (DAPL_DBG_TYPE_UTIL, " %s(): >%p\n",__FUNCTION__,hca_ptr ); > >- /* set inline max with enviroment or default */ >+ /* set RC tunables via enviroment or default */ > hca_ptr->ib_trans.max_inline_send = >- dapl_os_get_env_val ( "DAPL_MAX_INLINE", >INLINE_SEND_DEFAULT ); >+ dapl_os_get_env_val("DAPL_MAX_INLINE", >INLINE_SEND_DEFAULT); >+#if 0 >+ hca_ptr->ib_trans.ack_retry = >+ dapl_os_get_env_val("DAPL_ACK_RETRY", SCM_ACK_RETRY); >+ hca_ptr->ib_trans.ack_timer = >+ dapl_os_get_env_val("DAPL_ACK_TIMER", SCM_ACK_TIMER); >+ hca_ptr->ib_trans.rnr_retry = >+ dapl_os_get_env_val("DAPL_RNR_RETRY", SCM_RNR_RETRY); >+ hca_ptr->ib_trans.rnr_timer = >+ dapl_os_get_env_val("DAPL_RNR_TIMER", SCM_RNR_TIMER); >+ hca_ptr->ib_trans.global = >+ dapl_os_get_env_val("DAPL_GLOBAL_ROUTING", SCM_GLOBAL); >+ hca_ptr->ib_trans.hop_limit = >+ dapl_os_get_env_val("DAPL_HOP_LIMIT", SCM_HOP_LIMIT); >+ hca_ptr->ib_trans.tclass = >+ dapl_os_get_env_val("DAPL_TCLASS", SCM_TCLASS); >+#endif > > /* initialize cr_list lock */ > dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.lock); > if (dat_status != DAT_SUCCESS) > { > dapl_dbg_log (DAPL_DBG_TYPE_ERR, >- " open_hca: failed to init lock\n"); >+ "%s() failed to init cr_list lock\n", >__FUNCTION__); > return DAT_INTERNAL_ERROR; > } > >+#if 0 >+ /* initialize cq_lock */ >+ dat_status = dapl_os_lock_init(&hca_ptr->ib_trans.cq_lock); >+ if (dat_status != DAT_SUCCESS) { >+ dapl_log(DAPL_DBG_TYPE_ERR, >+ "%s() failed to init cq_lock\n", >__FUNCTION__); >+ return DAT_INTERNAL_ERROR; >+ } >+#endif >+ > /* initialize CM list for listens on this HCA */ > >dapl_llist_init_head((DAPL_LLIST_HEAD*)&hca_ptr->ib_trans.list); > >+ /* create pipe communication endpoints */ >+ if (_pipe(g_scm_pipe, 256, O_TEXT)) { >+ dapl_dbg_log (DAPL_DBG_TYPE_ERR, >+ "%s() failed to create thread\n", >__FUNCTION__); >+ return DAT_INTERNAL_ERROR; >+ } >+ > /* create thread to process inbound connect request */ > hca_ptr->ib_trans.cr_state = IB_THREAD_INIT; > dat_status = dapl_os_thread_create(cr_thread, >@@ -199,6 +236,7 @@ DAT_RETURN dapli_close_sock_cm ( IN >DAPL_HCA *hca_ptr ) > > /* destroy cr_thread and lock */ > hca_ptr->ib_trans.cr_state = IB_THREAD_CANCEL; >+ > while (hca_ptr->ib_trans.cr_state != IB_THREAD_EXIT) { > dapl_dbg_log(DAPL_DBG_TYPE_UTIL, > " close_hca: waiting for cr_thread\n"); > > > >_______________________________________________ >ofw mailing list >ofw at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw > From rdreier at cisco.com Fri Jan 30 19:54:12 2009 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 30 Jan 2009 19:54:12 -0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090130.135107.116108989.davem@davemloft.net> (David Miller's message of "Fri, 30 Jan 2009 13:51:07 -0800 (PST)") References: <20090130065721.GA4886@gondor.apana.org.au> <20090130.135107.116108989.davem@davemloft.net> Message-ID: > > I don't believe this is accurate. Calling skb_linearize() (on a kernel > > with CONFIG_HIGHMEM set) can end up calling local_bh_enable() in > > kunmap_skb_frag(), which can obviously cause problems if the initial > > context relies on having BHs disabled (as hard_start_xmit does). > local_bh_{enable,disable}() nests, so this is not a problem Duh. OK, then the only bugs seem to be that iw_nes does skb_linearize with irqs off (due to being an LLTX driver), and mv643xx_eth leaks an skb on its error path if skb_linearize fails. - R. From Jie.Cai at cs.anu.edu.au Fri Jan 30 20:47:45 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Sat, 31 Jan 2009 15:47:45 +1100 Subject: [ofa-general] Multiports single HCA uDAPL program problem In-Reply-To: References: <20090129200005.20863E61234@openfabrics.org> <4982A3D8.5030503@cs.anu.edu.au> Message-ID: <4983D7F1.9050002@cs.anu.edu.au> It does fix the problem. Thanks a lot. Jie -- Mr. Jie Cai Department of Computer Science Faculty of Engineering and Information Technology College of Engineering & Computer Science CSIT Building (108), North Road The Australian National University Canberra ACT 0200 Australia Email: Jie.Cai at cs.anu.edu.au Tel: +61-2-61251451 Fax: +61-2-61250010 Web: http://cs.anu.edu.au/~Jie.Cai Mobile: 0433992958 Davis, Arlin R wrote: > This looks like an ARP issue across your IPoIB interfaces. > > Please see section 6 of the uDAPL OFED BKM. > > http://www.openfabrics.org/downloads/dapl/documentation/uDAPL_ofed_testing_bkm.pdf > > 6. Multi IB port configuration, IPoIB arp reply issues > > When two interfaces running one interface may reply to an ARP > directed to the other interface on the system. The following > configuration will cause the interfaces to ignore ARP requests if > not specifically for their IP address. > > Add the following lines to /etc/sysctl.conf > net.ipv4.conf.all.arp_ignore=1 > net.ipv4.conf.ib0.arp_ignore=1 > net.ipv4.conf.ib1.arp_ignore=1 > > or use sysctl: > sysctl -w net.ipv4.conf.all.arp_ignore=1 > sysctl -w net.ipv4.conf.ib0.arp_ignore=1 > sysctl -w net.ipv4.conf.ib1.arp_ignore=1 > > -arlin > > >> -----Original Message----- >> From: general-bounces at lists.openfabrics.org >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Jie Cai >> Sent: Thursday, January 29, 2009 10:53 PM >> To: general at lists.openfabrics.org >> Subject: [ofa-general] Multiports single HCA uDAPL program problem >> >> Hi All, >> >> I am kind of noob on IB and uDAPL program. Currently, I am trying to >> write a program with multirail that utilizes 2 ports on a >> single Mallenox >> ConnectX HCA on both nodes. >> >> OFED1.3 has been installed on a SUSE 10.3 linux system. >> >> The current problem is that IB connection via uDAPL are very unstable, >> and sometime the connection can't be established. >> Error message is usually like: >> >> 20350 Server waiting for connect request on port 45248 >> accept: ERR dev(0x61d0e0!=0x61d0e0) or port mismatch(1!=2) >> 20350 Error dat_cr_accept: DAT_INTERNAL_ERROR >> 20350 Error connect_ep: DAT_INTERNAL_ERROR >> >> The status of both port are active: >> hca_id: mlx4_0 >> fw_ver: 2.3.000 >> node_guid: 0003:ba00:0100:702c >> sys_image_guid: 0003:ba00:0100:702f >> vendor_id: 0x02c9 >> vendor_part_id: 25418 >> hw_ver: 0xA0 >> board_id: SUN0070000001 >> phys_port_cnt: 2 >> port: 1 >> state: PORT_ACTIVE (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 10 >> port_lid: 8 >> port_lmc: 0x00 >> >> port: 2 >> state: PORT_ACTIVE (4) >> max_mtu: 2048 (4) >> active_mtu: 2048 (4) >> sm_lid: 10 >> port_lid: 9 >> port_lmc: 0x00 >> >> >> I haven't done any specific configuration for multi-port. I assume that >> OFED1.3 can do it automatically. >> >> Would please any one help me on this? >> >> Regards, >> Jie >> >> -- >> Jie Cai >> >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > From vlad at lists.openfabrics.org Sat Jan 31 03:12:49 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 31 Jan 2009 03:12:49 -0800 (PST) Subject: [ofa-general] ofa_1_4_kernel 20090131-0200 daily build status Message-ID: <20090131111249.D1947E61084@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: